Edit

Share via


Configure a Pacemaker cluster for SQL Server availability groups

Applies to: SQL Server - Linux

This article describes how to create a three-node cluster on Linux using Pacemaker, and add a previously created availability group as a resource in the cluster. For high availability, an availability group on Linux requires three nodes - see High availability and data protection for availability group configurations.

Note

Bias-free communication

This article contains references to the term slave, a term Microsoft considers offensive when used in this context. The term appears in this article because it currently appears in the software. When the term is removed from the software, we will remove it from the article.

SQL Server isn't as tightly integrated with Pacemaker on Linux as it is with Windows Server failover clustering (WSFC). A SQL Server instance isn't aware of the cluster, and all orchestration is from the outside in. Pacemaker provides cluster resource orchestration. Also, the virtual network name is specific to Windows Server failover clustering; there's no equivalent in Pacemaker. Availability group dynamic management views (DMVs) that query cluster information return empty rows on Pacemaker clusters. To create a listener for transparent reconnection after failover, manually register the listener name in DNS with the IP used to create the virtual IP resource.

You can still create a listener for transparent reconnection after failover, but you have to manually register the listener name in the DNS server with the IP used to create the virtual IP resource (as explained in the following sections).

The following sections walk through the steps to set up a Pacemaker cluster and add an availability group as resource in the cluster for high availability, for each supported Linux distribution.

Roadmap

The steps to create an availability group on Linux servers for high availability are different from the steps on a Windows Server failover cluster. The following list describes the high-level steps:

  1. Installation guidance for SQL Server on Linux.

  2. Configure SQL Server Always On Availability Group for high availability on Linux.

  3. Configure a cluster resource manager, like Pacemaker. These instructions are in this article.

    The way to configure a cluster resource manager depends on the specific Linux distribution.

    Important

    Production environments require a fencing agent for high availability. The examples in this article don't use fencing agents. They are for testing and validation only.

    A Linux cluster uses fencing to return the cluster to a known state. The way to configure fencing depends on the distribution and the environment. Currently, fencing isn't available in some cloud environments.

    Fencing is normally implemented at the operating system and is dependent on the environment. Find instructions for fencing in the operating system distributor documentation.

  4. Add the availability group as a resource in the cluster.

Install and configure Pacemaker on each cluster node

  1. On all nodes, open the firewall ports. Open the port for the Pacemaker high-availability service, SQL Server instance, and the availability group endpoint. The default TCP port for server running SQL Server is 1433.

    sudo ufw allow 2224/tcp
    sudo ufw allow 3121/tcp
    sudo ufw allow 21064/tcp
    sudo ufw allow 5405/udp
    
    sudo ufw allow 1433/tcp # Replace with TDS endpoint
    sudo ufw allow 5022/tcp # Replace with DATA_MIRRORING endpoint
    
    sudo ufw reload
    

    Alternatively, you can disable the firewall, but this isn't recommended in a production environment:

    sudo ufw disable
    
  2. Install Pacemaker packages. On all nodes, run the following commands for Ubuntu 20.04. For more information about installing on previous versions, see Ubuntu HA - MS SQL Server on Azure.

    sudo apt-get install -y pacemaker pacemaker-cli-utils crmsh resource-agents fence-agents corosync python3-azure
    
  3. Set the password for the default user that is created when installing Pacemaker and Corosync packages. Use the same password on all nodes.

    sudo passwd hacluster
    

Create the cluster

  1. Before creating a cluster, you must create an authentication key on the primary server, and copy it to the other servers participating in the AG.

    Use the following script to create an authentication key on the primary server:

    sudo corosync-keygen
    

    You can use scp to copy the generated key to other servers:

    sudo scp /etc/corosync/authkey dbadmin@server-02:/etc/corosync
    sudo scp /etc/corosync/authkey dbadmin@server-03:/etc/corosync
    
  2. To create the cluster, edit the /etc/corosync/corosync.conf file on the primary server:

    sudo vim /etc/corosync/corosync.conf
    

    The corosync.conf file should look similar to the following example:

    totem {
        version: 2
        cluster_name: agclustername
        transport: udpu
        crypto_cipher: none
        crypto_hash: none
    }
    logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        debug: off
        logger_subsys {
            subsys: QUORUM
            debug: off
        }
    }
    quorum {
        provider: corosync_votequorum
    }
    nodelist {
        node {
            name: server-01
            nodeid: 1
            ring0_addr: 10.0.0.4
        }
        node {
            name: server-02
            nodeid: 2
            ring0_addr: 10.0.0.5
        }
            node {
            name: server-03
            nodeid: 3
            ring0_addr: 10.0.0.6
        }
    }
    

    Replace the corosync.conf file on other nodes:

    sudo scp /etc/corosync/corosync.conf dbadmin@server-02:/etc/corosync
    sudo scp /etc/corosync/corosync.conf dbadmin@server-03:/etc/corosync
    

    Restart the pacemaker and corosync services:

    sudo systemctl restart pacemaker corosync
    

    Confirm the status of cluster and verify the configuration:

    sudo crm status
    

Considerations for multiple network interfaces (NICs)

When setting up high availability with servers that have multiple NICs, follow these suggestions:

  • Make sure the hosts file is set up so that the server IP addresses for the multiple NICs resolve to the hostname of the Linux server on each node.

  • When setting up the cluster using Pacemaker, using the hostname of the servers should configure Corosync to set the configuration for all of the NICs. We only want the Pacemaker/Corosync communication over a single NIC. Once the Pacemaker cluster is configured, modify the configuration in the corosync.conf file, and update the IP address for the dedicated NIC you want to use for the Pacemaker/Corosync communication.

  • The <hostname> given in the corosync.conf file should be the same as the output given when doing a reverse lookup (ping -a <ip_address>), and should be the short name configured on the host. Make sure the hosts file also represents the proper IP address to name resolution.

The changes to the corosync.conf file example are highlighted below:

  nodelist {
    node {
        ring0_addr: <ip_address_of_node1_NIC1>
        name: <hostname_of_node1>
        nodeid: 1
    }
    node {
        ring0_addr: <ip_address_of_node2_NIC1>
        name: <hostname_of_node2>
        nodeid: 2
    }
    node {
        ring0_addr: <ip_address_of_node3_NIC1>
        name: <hostname_of_node3>
        nodeid: 3
    }
  }

Configure a fencing device

Pacemaker cluster vendors require fencing a failed node, using a fencing device configured for a supported cluster setup. When the cluster resource manager can't determine the state of a node or of a resource on a node, fencing brings the cluster to a known state again.

Resource level fencing ensures that no data corruption occurs if there's an outage. You can use resource level fencing, for instance, with DRBD (Distributed Replicated Block Device) to mark the disk on a node as outdated when the communication link goes down.

Node level fencing ensures that a node doesn't run any resources. This is done by resetting the node, and the Pacemaker implementation is called STONITH. Pacemaker supports a great variety of fencing devices, for example, an uninterruptible power supply or management interface cards for servers.

For more information, see Pacemaker Clusters from Scratch and Fencing and Stonith.

Because the node level fencing configuration depends heavily on your environment, we disable it for this tutorial (it can be configured at a later time). Run the following script on the primary node:

sudo crm configure property stonith-enabled=false

In this example, disabling fencing is just for testing purposes. If you plan to use Pacemaker in a production environment, you should plan a fencing implementation depending on your environment and keep it enabled. Contact the operating system vendor for information about fencing agents for any specific distribution.

Set cluster property cluster-recheck-interval

The cluster-recheck-interval property indicates the polling interval at which the cluster checks for changes in the resource parameters, constraints, or other cluster options. If a replica goes down, the cluster tries to restart the replica at an interval that is bound by the failure-timeout value and the cluster-recheck-interval value. For example, if failure-timeout is set to 60 seconds and cluster-recheck-interval is set to 120 seconds, the restart is tried at an interval that is greater than 60 seconds but less than 120 seconds. You should set failure-timeout to 60 seconds, and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a smaller value isn't recommended.

To update the property value to 2 minutes run:

sudo crm configure property cluster-recheck-interval=2min

If you already have an availability group resource managed by a Pacemaker cluster, Pacemaker package 1.1.18-11.el7 introduced a behavior change for the start-failure-is-fatal cluster setting when its value is false. This change affects the failover workflow. If a primary replica experiences an outage, the cluster is expected to fail over to one of the available secondary replicas. Instead, users notice that the cluster keeps trying to start the failed primary replica. If that primary never comes online (because of a permanent outage), the cluster never fails over to another available secondary replica. Because of this change, a previously recommended configuration to set start-failure-is-fatal is no longer valid, and the setting needs to be reverted back to its default value of true.

Additionally, the AG resource needs to be updated to include the failure-timeout property.

To update the property value to true run:

sudo crm configure property start-failure-is-fatal=true

Update your existing AG resource property failure-timeout to 60s run (replace ag1 with the name of your availability group resource):

sudo crm configure meta failure-timeout=60s

Install SQL Server resource agent for integration with Pacemaker

Run the following commands on all nodes.

sudo apt-get install mssql-server-ha

Create a SQL Server login for Pacemaker

Caution

Your password should follow the SQL Server default password policy. By default, the password must be at least eight characters long and contain characters from three of the following four sets: uppercase letters, lowercase letters, base-10 digits, and symbols. Passwords can be up to 128 characters long. Use passwords that are as long and complex as possible.

  1. On all SQL Server instances, create a server login for Pacemaker.

    The following Transact-SQL creates a login. Replace <password> with your own complex password.

    USE [master];
    GO
    
    CREATE LOGIN [pacemakerLogin]
        WITH PASSWORD = N'<password>';
    
    ALTER SERVER ROLE [sysadmin] ADD MEMBER [pacemakerLogin];
    

    At the time of availability group creation, the Pacemaker user requires ALTER, CONTROL, and VIEW DEFINITION permissions on the availability group, after it's created but before any nodes are added to it.

  2. On all SQL Server instances, save the credentials for the SQL Server login.

    Replace <password> with your own complex password.

    echo 'pacemakerLogin' >> ~/pacemaker-passwd
    echo '<password>' >> ~/pacemaker-passwd
    sudo mv ~/pacemaker-passwd /var/opt/mssql/secrets/passwd
    sudo chown root:root /var/opt/mssql/secrets/passwd
    sudo chmod 400 /var/opt/mssql/secrets/passwd # Only readable by root
    

Create availability group resource

To create the availability group resource, use the sudo crm configure command to set the resource properties. The following example creates a primary/replica type resource ocf:mssql:ag for an availability group with name ag1.

~$ sudo crm

configure

primitive ag1_cluster \
ocf:mssql:ag \
params ag_name="ag1" \
meta failure-timeout=60s \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=60s \
op demote timeout=10s \
op monitor timeout=60s interval=10s \
op monitor timeout=60s on-fail=demote interval=11s role="Master" \
op monitor timeout=60s interval=12s role="Slave" \
op notify timeout=60s
ms ms-ag1 ag1_cluster \
meta master-max="1" master-node-max="1" clone-max="3" \
clone-node-max="1" notify="true"

commit

Note

When you create the resource, and periodically afterwards, the Pacemaker resource agent automatically sets the value of REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT on the availability group based on the availability group's configuration. For example, if the availability group has three synchronous replicas, the agent will set REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT to 1. For details and additional configuration options, see High availability and data protection for availability group configurations.

Create virtual IP resource

To create the virtual IP address resource, run the following command on one node. Use an available static IP address from the network. Before you run the script, replace the values between < ... > with a valid IP address.

sudo crm configure primitive virtualip \
ocf:heartbeat:IPaddr2 \
params ip=10.128.16.240

There's no virtual server name equivalent in Pacemaker. To use a connection string that points to a string server name and not use the IP address, register the IP resource address and desired virtual server name in DNS. For DR configurations, register the desired virtual server name and IP address with the DNS servers on both primary and DR site.

Add colocation constraint

Almost every decision in a Pacemaker cluster, like choosing where a resource should run, is done by comparing scores. Scores are calculated per resource, and the cluster resource manager chooses the node with the highest score for a particular resource. (If a node has a negative score for a resource, the resource can't run on that node.)

Use constraints to configure the decisions of the cluster. Constraints have a score. If a constraint has a score lower than INFINITY, it's only a recommendation. A score of INFINITY means it's mandatory.

To ensure that primary replica and the virtual ip resource are on the same host, define a colocation constraint with a score of INFINITY. To add the colocation constraint, run the following command on one node.

sudo crm configure colocation ag-with-listener INFINITY: virtualip-group ms-ag1:Master

Add ordering constraint

The colocation constraint has an implicit ordering constraint. It moves the virtual IP resource before it moves the availability group resource. By default the sequence of events is:

  1. User issues pcs resource move to the availability group primary from node1 to node2.

  2. The virtual IP resource stops on node1.

  3. The virtual IP resource starts on node2.

    At this point, the IP address temporarily points to node2 while node2 is still a pre-failover secondary.

  4. The availability group primary on node1 is demoted to secondary.

  5. The availability group secondary on node2 is promoted to primary.

To prevent the IP address from temporarily pointing to the node with the pre-failover secondary, add an ordering constraint.

To add an ordering constraint, run the following command on one node:

sudo crm configure order ag-before-listener Mandatory: ms-ag1:promote virtualip-group:start

After you configure the cluster and add the availability group as a cluster resource, you can't use Transact-SQL to fail over the availability group resources. SQL Server cluster resources on Linux aren't coupled as tightly with the operating system as they are on a Windows Server Failover Cluster (WSFC). The SQL Server service isn't aware of the presence of the cluster. All orchestration is done through the cluster management tools.


Additional resources

Training

Learning path

Run high-performance computing (HPC) applications on Azure - Training

Azure HPC is a purpose-built cloud capability for HPC & AI workload, using leading-edge processors and HPC-class InfiniBand interconnect, to deliver the best application performance, scalability, and value. Azure HPC enables users to unlock innovation, productivity, and business agility, through a highly available range of HPC & AI technologies that can be dynamically allocated as your business and technical needs change. This learning path is a series of modules that help you get started on Azure HPC - you