SQL 2012 AlwaysOn - Recovery from a Disaster

We have configured 3 SQL server in an AlwaysOn Availability group (see here)
Two servers (SQLDEV1 and SQLDEV2) are configured for automatic failover.

To test how long it takes for SQL Server to fail over from SQLDEV1 to SQLDEV2 we connect to the availability group listener name (sqldevcluster1)

and on the "Additional Connection Parameters" tab we need to add

Multisubnetfailover=TRUE

We can check the current status of the servers in the availability group with this query

select rs.role_desc, rs.connected_state_desc, rs.synchronization_health_desc, cs.replica_server_name

from sys.dm_hadr_availability_replica_states AS rs

JOIN sys.dm_hadr_availability_replica_cluster_states as CS ON rs.replica_id=cs.replica_id

and it returns

If we connect to SQLDEV2 and initiate manual failover

ALTER AVAILABILITY GROUP AG1 FAILOVER;

this statement takes 17 seconds to complete.
And if we check the status of the availability group from the session connected to sqldevcluster1 we get the transport-level error (because the original connection was closed). This error message was returned in about 10 seconds.

Msg 121, Level 20, State 0, Line 0

A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)

If we check the status of the availability group again the query returns the results instantaneously

and we see that the primary role is owned by SQLDEV2.

Now if we stop the SQL Server service on SQLDEV2 then in about 15 seconds we'll see roles swapped as

and if we re-run the query in 5 seconds the connection with SQLDEV2 is established

and in a couple of seconds the synchronization gets to the healthy status.

If we reboot the SQLDEV1 server now the failover process is exactly the same. in about 20 seconds SQLDEV2 takes over the primary role and then automatically synchronizes itself with SQLDEV1 when it comes back online.

Recovery from a Disaster

The Disaster Recovery process is different because it needs some additional actions and manual steps.

When the primary data center is inaccessible the windows cluster service on SQLDEV3 cannot start because there. is no quorum. So, our first step is to force the cluster service to start without quorum.

To do this we need to execute the following command from the elevated command prompt

net start clussvc /forcequorum

After forcing the cluster service to start we will see on SQLDEV3 in SSMS

Now we need to initiate manual failover with possible data lost

ALTER AVAILABILITY GROUP AG1 FORCE_FAILOVER_ALLOW_DATA_LOSS;

And this will make SQLDEV3 our primary as show below

Resuming Normal Operations

When the Primary Data Center comes back online and both SLDEV1 and SQLDEV2 servers are operational we have to resume the replication within the availability group.

We have to connecte to each of SQLDEV1 and SQLDEV2 and resume the replication by executing the following statements

ALTER DATABASE TESTDB1 SET HADR RESUME;

ALTER DATABASE TESTDB2 SET HADR RESUME;

To transfer the primary role back to SQLDEV1 we need temporarily to change the replication mode to synchronous commit on all three servers.

We do this from SQLDEV3 by executing the following statements:

USE master

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV1' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV2' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);

Then when connected to SQLDEV1 we do automatic failover

ALTER AVAILABILITY GROUP AG1 FAILOVER

And after successful role transfer to SQLDEV1 we change the replication mode on SQLDEV3 back to asynchronous commit

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=ASYNCHRONOUS_COMMIT);

And the final step is to transfer the core Cluster Resource back to a server in the Primary Data Center.

In Failover Cluster Manager

we see that the active virtual IP is 192.168.69.134 which is an IP in the Secondary Data Center.

To move the core cluster resources back to SQLDEV1

Right-click on the cluster name and select

More Actions à Move Core Cluster Resources à Select Node

When completed (in ~10 seconds) the status will show that current Host server is SQLDEV1

legi, intellexi

Search This Blog