Availability Groups: 5 Things

Here are some configurations that I feel are worth noting about AlwaysOn Availability Groups.  There are plenty of other best practices and configuration guides around but these are issues I’ve experienced when supporting existing AlwaysOn AGs and have therefore re-configured them or added these configurations to my standard AG builds.

Tempdb

These Tempdb configurations should ALL be general best practice but it’s even more important with Availability Groups.

  • Tempdb should be on its own separate, FAST drive. a SSD would be ideal
  • Tempdb should be set up to have a minimum of 4 files. Tempdb should have the same number of files as the server has CPU cores.
  • Tempdb should be pre-grown to fill the drive so you do not get any horrible autogrow events occurring, especially after service restarts.
  • Tempdb should be configured the same on all AG nodes.

You really should be configuring tempdb like this by default anyway. Your tempdb is going to be used differently on your primary and your secondary read-only nodes. Your read-only nodes will have different statistics to your primary and these will be stored in tempdb because, well, they are temporary! Also, snapshot isolation is used for all queries on a read-only node so your version store will be taking up lots of room in tempdb too with all those select queries taking place.  If you ever get tempdb space issues on a secondary node, it is more than likely the version store taking up most of that space.

Your primary tempdb will also be used for all your index rebuilds too, if you implement the following.

Index Maintenance

I’ve blogged about this before.  Index rebuilds are not heavily using the transaction logs like reorganizing does so try rebuilding all indexes rather than rebuilding the most heavily fragmented and reorganising the others. Also, use tempdb to sort your indexes (providing it is well optimised!). With your index rebuilds happening elsewhere and no reorganisations going on, your sync speed will improve and your HADR_SYNC_COMMIT waits will reduce. That log won’t inflate as much either, which leads me on to my next tip.

Transaction Logs

Your transaction logs are under a lot more strain when using any form of high availability (AGs or mirroring) therefore having a large transaction log and a drive for your transaction logs with room to grow will help not only with syncing but also when a secondary node goes offline and can’t sync. When this happens, your primary node will not truncate the transaction log until the data has successfully synced to the nodes currently offline. If other nodes are offline for a long time, this can cause a problem not just with log size but also with playing sync catch-up. If you followed the tip above and have stopped reorganising your indexes and are now rebuilding as well as sorting in tempdb then your log is already looking a lot healthier.

SQL Agent and Logins

I’ve touched on this in previous blog posts but it’s important to remember that the SQL Agent and your logins (as well as alerts, operators, linked servers) are not part of your High Availability. Therefore whenever a new job or login is created, you’ll need to create it on your secondary nodes too and specific jobs may need to be aware of whether they are running on the Primary node or not otherwise they could fail. Along with the function below, I’ve set up automation scripts using PowerShell to resolve this all this. You can also use the DBATools PowerShell module to ease the manual labour (specifically copy-dbaAgentJob and copy-dbaLogin).

IF OBJECT_ID('dbo.fn_hadr_group_is_primary', 'FN') IS NOT NULL
  DROP FUNCTION dbo.fn_hadr_group_is_primary;
GO
CREATE FUNCTION dbo.fn_hadr_group_is_primary (@AGName sysname)
RETURNS bit
AS
BEGIN;
  DECLARE @PrimaryReplica sysname;
  SELECT
    @PrimaryReplica = hags.primary_replica
  FROM sys.dm_hadr_availability_group_states hags
  INNER JOIN sys.availability_groups ag ON ag.group_id = hags.group_id
  WHERE ag.name = @AGName;
  IF UPPER(@PrimaryReplica) =  UPPER(@@SERVERNAME)
    RETURN 1; -- primary
    RETURN 0; -- not primary
END;

Cluster configuration changes

This I picked up from another blog called sqlbp. Your high availability isn’t quite so high if you have problems with disks and bandwidth. However, if you do suffer from regular network timeouts, to make your cluster more resilient, you can amend your heartbeat delays and thresholds.

get-cluster | fl *Subnet* #confirm current settings
$cluster = get-cluster
$cluster.CrossSubnetDelay = 4000
$cluster.CrossSubnetThreshold = 10
$cluster.SameSubnetDelay = 2000
$cluster.SameSubnetThreshold = 10
$cluster.RouteHistoryLength = 20

The above settings will increase the heartbeat delay between nodes  to 4 seconds on the cross subnet delay which compensates for network timeouts although the threshold has been set to 10 heartbeats.  Therefore a heartbeat is sent every 4 seconds, meaning there could be a 40 seconds outage before recovery takes place. I’ve set this to 20 seconds for the same subnet.

It’s a fine balance between proactive and reactive here so the above settings may not be exact for your environment. Generally it’s preferred to have a shorter delay and larger threshold but if you are having a lot of network timeouts then increasing the delay and having a smaller threshold means that you are likely to have less failovers but a longer timeout before a failover takes place. Again, not such high availability but it prevents needless failovers due to low bandwidth which has sometimes been a real problem for me. This blog post goes some way into explaining this in more detail.