Building High Available HPC Pack Cluster in Azure

Article
06/13/2022

In this article we are going to provide the steps and consideration to build a high available HPC Pack Cluster in Azure.

Consideration on Cluster High Available

A typical HPC Pack Cluster consists of a SQL server with our databases that stores HPC jobs; A head node server that runs critical services such as scheduler service SDM service; A set of compute nodes that connect to the services on the head node run user's HPC workloads. Besides, we also need either a domain controller that serves authentication for the clients. All these components are inter-connected through network.

In a Azure cloud environment, any of the above components may fail, for example, the head node rebooted for windows update, some compute nodes may reboot because you're using low priority VM. Thus how can we set up a high available HPC Pack cluster that satisfies:

Any component mentioned above failed, the user's workload can still running without being cancelled or failed
Tasks running on a failed compute nodes shall be re-scheduled to other compute nodes
The cluster shall still be able to serve the functionality including cluster management, job management

Thus let's discuss every component failure situation and their high availability solution.

Dealing with database failure

You have couple of choice to get a high available SQL database on cloud:

Using Azure SQL Database
Using ARM template to deploy a SQL AlwaysOn Cluster, you can refer to this blog

Dealing with Head node failure

Set up at least 3 head nodes cluster. With this configuration, any head node failure will result in moving the active HPC Service from this head node to others.

Dealing with AD failure

When HPC failed to connect to the Domain controller, admin and user will not be able to connect to the HPC Service thus not able to manage and submit jobs to the cluster. And new jobs will not be able started on the domain joined computer nodes as the NodeManager service failed to validate the job's credential. Thus you need consider below options:

Having a high available domain controller deployed with your HPC Pack Cluster in Azure
Using Azure AD Domain service. During cluster deployment, you could just join all your cluster nodes into this domain and you get the high available domain service from Azure.
Using HPC Pack Azure AD integration solution without having the cluster nodes joining any domain. Thus as long as the HPC Service has connectivity to the Azure AD service.

Dealing with Network failure

Network itself in Azure data center is high available, thus we don't need to have backup network.

Building High Available HPC Pack Cluster

We have an ARM template here, select that is able to deploy a high available HPC Cluster with options of:

Create Azure SQL Database
Connect to existing Active Directory Domain
Create 3-head-node HPC Pack Cluster

Template: High-availability cluster with Azure SQL databases for Windows workloads with existing Active Directory Domain

This template deploys an HPC Pack cluster with high availability for Windows HPC workloads in an existing Active Directory Domain forest. The cluster includes three head nodes, SQL Azure databases, and a configurable number of Windows compute nodes.

HPC Pack Cluster Shares

Currently in all HPC Pack ARM templates we create the cluster share on one of the head node which is not high available as if that head node is down, the share will not be accessible to the HPC Service running on other head node. Basically it will not impact running jobs and managing the nodes.

With Azure Files, these file shares can be moved to Azure Files shares with SMB permissions to make them high available. Please refer to this doc.

Share Name	Usage	Default Location	Impact when down	Way to make high available
Remote Install Share	After cluster setup, we put HPC Pack setup binaries in this share folder so that client machines and compute machines can do installation directory from this share.	`\\<HN3>\REMINST`	When this share is down or not accessible, it has no impact to any existing functionality of HPC Cluster.	Cluster admin can also create same shares on the other two head nodes and copy the set up binaries there as well so that any head node down, the share is still available
HPC SOA Registration Share	This share stores SOA service registration file	`\\<HN3>\HpcServiceRegistration`	SOA service job that relies on the registration files in this share will fail to run	When register new SOA service configuration file, do not put the registration file in share but using Import High Available Configuration File... from the Cluster Manager to import the SOA service registration file into HPC Cluster reliable store so that the registration file will be available even when the share is down
HPC SOA Runtime Share	This share stores SOA job's common Data	`\\<HN3>\Runtime$`	SOA job with common data will fail	SOA client need put the common data into azure storage so that the common data is still available even the runtime share is down
HPC SOA TraceRepository	The Soa diagnostics traces repository.	`\\<HN3>\TraceRepository`	If SOA diagnostics tracing is turned on, the trace would fail to collect.	Use Azure Files share.
HPC Diagnostics Share	This share stores diagnostics test result	`\\<HN3>\Diagnostics`	When this share is down, HPC Diagnostics job will fail as it has no place to write test result.	Cluster admin can switch to a new diag share when he want to run diag tests. To change to a new diag share, run HPC powershell cmd `set-HpcClusterRegistry -PropertyName DiagnosticsShare -PropertyValue "\\<HN2>\diagnostics"`
CcpSpoolDir	Output spool share for compute nodes.	`\\<HN3>\CcpSpoolDir`	If used for task output, the task would fail to write output data.	Use Azure Files share.

Share via