Skip to main content

SQL 2012 AlwaysOn - Recovery from a Disaster

We have configured 3 SQL server in an AlwaysOn Availability group (see here)
Two servers (SQLDEV1 and SQLDEV2) are configured for automatic failover.

To test how long it takes for SQL Server to fail over from SQLDEV1 to SQLDEV2 we connect to the availability group listener name (sqldevcluster1)


and on the "Additional Connection Parameters" tab we need to add

Multisubnetfailover=TRUE




We can check the current status of the servers in the availability group with this query


select rs.role_desc, rs.connected_state_desc, rs.synchronization_health_desc, cs.replica_server_name
from sys.dm_hadr_availability_replica_states AS rs
JOIN sys.dm_hadr_availability_replica_cluster_states as CS ON rs.replica_id=cs.replica_id



and it returns

If we connect to SQLDEV2 and initiate manual failover


ALTER AVAILABILITY GROUP AG1 FAILOVER;



this statement takes 17 seconds to complete.
And if we check the status of the availability group from the session connected to sqldevcluster1 we get the transport-level error (because the original connection was closed). This error message was returned in about 10 seconds.


Msg 121, Level 20, State 0, Line 0
A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)

If we check the status of the availability group again the query returns the results instantaneously

and we see that the primary role is owned by SQLDEV2.

Now if we stop the SQL Server service on SQLDEV2 then in about 15 seconds we'll see roles swapped as


 and if we re-run the query in 5 seconds the connection with SQLDEV2 is established

and in a couple of seconds the synchronization gets to the healthy status.

If we reboot the SQLDEV1 server now the failover process is exactly the same. in about 20 seconds SQLDEV2 takes over the primary role  and then automatically synchronizes itself with SQLDEV1 when it comes back online.

Recovery from a Disaster


The Disaster Recovery process  is different because it needs some additional actions and manual steps.

When the primary data center is inaccessible the windows cluster service on SQLDEV3 cannot start because there. is no quorum. So, our first step is to force the cluster service to start without quorum.
To do this we need to execute the following command from the elevated command prompt

net start clussvc /forcequorum

After forcing the cluster service to start we will see on SQLDEV3 in SSMS


Now we need to initiate manual failover with possible data lost

ALTER AVAILABILITY GROUP AG1 FORCE_FAILOVER_ALLOW_DATA_LOSS;


And this will make SQLDEV3 our primary as show below



Resuming Normal Operations


When the Primary Data Center comes back online and both SLDEV1 and SQLDEV2 servers are operational we have to resume the replication within the availability group.

We have to connecte to each of SQLDEV1 and SQLDEV2 and resume the  replication by executing the following statements

ALTER DATABASE TESTDB1 SET HADR RESUME;

ALTER DATABASE TESTDB2 SET HADR RESUME;



To transfer the primary role back to SQLDEV1 we need temporarily to change the replication mode to synchronous commit on all three servers.
We do this from SQLDEV3 by executing the following statements:

USE master
Go

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV1' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO
ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV2' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO


Then when connected to SQLDEV1 we do automatic failover

ALTER AVAILABILITY GROUP AG1 FAILOVER


And after successful role transfer to SQLDEV1 we change the replication mode on SQLDEV3 back to asynchronous commit

ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=ASYNCHRONOUS_COMMIT);
GO


And the final step is to transfer the core Cluster Resource back to a server in the Primary Data Center.
In Failover Cluster Manager



we see that the active virtual IP is 192.168.69.134 which is an IP in the Secondary Data Center.

To move the core cluster resources back to SQLDEV1

Right-click on the cluster name and select


More Actions à Move Core Cluster Resources à Select Node

When completed (in ~10 seconds) the status will show that current Host server is SQLDEV1





Comments

Popular posts from this blog

Create 3-Node Windows 2012 Multi-subnet Cluster

Environment There are two Data centers connected via a WAN link. Two Windows 2012 Servers (called SQLDEV1 and SQLDEV2) are located in the Primary Data Center (on the IP subnet 192.168.79.0/24) and the third server is placed in the Secondary Data Center with the 192.168.69.0/24 subnet. We’ll be creating a three-node Windows cluster with no shared storage on the multi subnet network with a file share witness at the Primary Data Center. We’ll be using a file share witness to protect from the cluster failure in a situation when the network between the Data Centers is unavailable and one of the servers in the Primary Data Center is also down (or being rebooted). The final state will look like depicted above: -           Two Virtual IP’s will be assigned (192.168.76.218 and 192.168.69.134) to the cluster -           The servers at the Primary Data Center will have a vote (Vote=1) and the ...

Joining Windows 10 to Azure AD Domain

As of October 2016 to join Windows 10 computers to Azure AD Domain service requires these steps: Create a VNET in the classic portal . The VNET must be placed to a region where Azure AD domain service is available (( https://azure.microsoft.com/en-us/regions/services/ )  In the classic portal  go to Directory -> Configure and enable the domain service. And wait for ~ 30 min When completed the IP address will be populated Go back to the VNET configuration and add a DNS server with the IP (10.0.0.4 in this case) Create the "AAD DC Administrator" administrators group (again in Directory -> Group). Members of this group are granted administrative privileges on machines that are domain-joined to the Azure AD Domain Services managed domain. Add to the group your users who are supposed to have the administrative access on a Windows 10 computer go to Settings -> Accounts (this is true for Windows 10 version  1607) then select 'Access...

SQL 2012 AlwaysOn: Synchronous vs. Asynchronous commit. Performance impact

Recently I've had a chance to build a 3-server AlwaysOn environment distributed between the primary and secondary data centers. The configuration looks like this: Primary Data Center                         Secondary Data Center                        SQLDEV1                                        SQLDEV3          SQLDEV2 The availability group was crated with synchronous commit replicas on SQLDEV1 and SQLDEV2 and the replica on SQLDEV3 was configured for asynchronous commit. The link between the data centers was not great and when I pinged SQLDEV3 from SQLDEV1 I got these results Approximate round trip times in milli-seconds:     Minimum = 39ms, Maximum = 63ms, Average = 42ms I also created a very simp...