We have configured 3 SQL server in an AlwaysOn Availability group (see here)
Two servers (SQLDEV1 and SQLDEV2) are configured for automatic failover.
To test how long it takes for SQL Server to fail over from SQLDEV1 to SQLDEV2 we connect to the availability group listener name (sqldevcluster1)
and on the "Additional Connection Parameters" tab we need to add
Multisubnetfailover=TRUE
We can check the current status of the servers in the availability group with this query
and it returns
If we connect to SQLDEV2 and initiate manual failover
this statement takes 17 seconds to complete.
And if we check the status of the availability group from the session connected to sqldevcluster1 we get the transport-level error (because the original connection was closed). This error message was returned in about 10 seconds.
If we check the status of the availability group again the query returns the results instantaneously
and we see that the primary role is owned by SQLDEV2.
Now if we stop the SQL Server service on SQLDEV2 then in about 15 seconds we'll see roles swapped as
and if we re-run the query in 5 seconds the connection with SQLDEV2 is established
and in a couple of seconds the synchronization gets to the healthy status.
If we reboot the SQLDEV1 server now the failover process is exactly the same. in about 20 seconds SQLDEV2 takes over the primary role and then automatically synchronizes itself with SQLDEV1 when it comes back online.
The Disaster Recovery process is different because it needs some additional actions and manual steps.
When the primary data center is inaccessible the windows cluster service on SQLDEV3 cannot start because there. is no quorum. So, our first step is to force the cluster service to start without quorum.
To do this we need to execute the following command from the elevated command prompt
net start clussvc /forcequorum
After forcing the cluster service to start we will see on SQLDEV3 in SSMS
When the Primary Data Center comes back online and both SLDEV1 and SQLDEV2 servers are operational we have to resume the replication within the availability group.
We have to connecte to each of SQLDEV1 and SQLDEV2 and resume the replication by executing the following statements
To transfer the primary role back to SQLDEV1 we need temporarily to change the replication mode to synchronous commit on all three servers.
We do this from SQLDEV3 by executing the following statements:
Then when connected to SQLDEV1 we do automatic failover
And after successful role transfer to SQLDEV1 we change the replication mode on SQLDEV3 back to asynchronous commit
And the final step is to transfer the core Cluster Resource back to a server in the Primary Data Center.
In Failover Cluster Manager
we see that the active virtual IP is 192.168.69.134 which is an IP in the Secondary Data Center.
Two servers (SQLDEV1 and SQLDEV2) are configured for automatic failover.
To test how long it takes for SQL Server to fail over from SQLDEV1 to SQLDEV2 we connect to the availability group listener name (sqldevcluster1)
and on the "Additional Connection Parameters" tab we need to add
Multisubnetfailover=TRUE
We can check the current status of the servers in the availability group with this query
select rs.role_desc, rs.connected_state_desc,
rs.synchronization_health_desc,
cs.replica_server_name
from sys.dm_hadr_availability_replica_states AS rs
JOIN sys.dm_hadr_availability_replica_cluster_states as CS ON rs.replica_id=cs.replica_id
and it returns
If we connect to SQLDEV2 and initiate manual failover
ALTER AVAILABILITY GROUP AG1 FAILOVER;
this statement takes 17 seconds to complete.
And if we check the status of the availability group from the session connected to sqldevcluster1 we get the transport-level error (because the original connection was closed). This error message was returned in about 10 seconds.
Msg 121, Level
20, State 0, Line 0
A transport-level
error has occurred when receiving results from the server. (provider: TCP
Provider, error: 0 - The semaphore timeout period has expired.)
Recovery from a Disaster
Now we need to initiate manual failover with possible data lost
ALTER AVAILABILITY GROUP AG1 FORCE_FAILOVER_ALLOW_DATA_LOSS;
And this will make SQLDEV3 our primary as show below
Resuming Normal Operations
ALTER DATABASE TESTDB1 SET HADR RESUME;
ALTER DATABASE TESTDB2 SET HADR RESUME;
USE master
Go
ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV1' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO
ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV2' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO
ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=SYNCHRONOUS_COMMIT);
GO
ALTER AVAILABILITY GROUP AG1 FAILOVER
ALTER AVAILABILITY GROUP AG1 MODIFY REPLICA ON N'SQLDEV3' WITH (AVAILABILITY_MODE=ASYNCHRONOUS_COMMIT);
GO
To move the core cluster resources back to SQLDEV1
Right-click on the cluster name and select
More Actions à Move Core Cluster Resources à Select Node
When
completed (in ~10 seconds) the status will show that current Host server is
SQLDEV1
Comments
Post a Comment