Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
You can enable OpenPBS on a CycleCloud cluster by changing the run_list
in the configuration section of your cluster definition. A PBS Professional (PBS Pro) cluster has two main parts: the primary node, which runs the software on a shared filesystem, and the execute nodes, which mount that filesystem and run the submitted jobs. For example, a simple cluster template snippet might look like:
[cluster my-pbspro]
[[node master]]
ImageName = cycle.image.centos7
MachineType = Standard_A4 # 8 cores
[[[configuration]]]
run_list = role[pbspro_master_role]
[[nodearray execute]]
ImageName = cycle.image.centos7
MachineType = Standard_A1 # 1 core
[[[configuration]]]
run_list = role[pbspro_execute_role]
When you import and start a cluster with this definition in CycleCloud, you get a single primary node. You can add execute nodes to the cluster by using the cyclecloud add_node
command. For example, to add 10 more execute nodes, use:
cyclecloud add_node my-pbspro -t execute -c 10
PBS resource-based autoscaling
CycleCloud maintains two resources to expand the dynamic provisioning capability. These resources are nodearray and machinetype.
When you submit a job and specify a node array resource with qsub -l nodearray=highmem -- /bin/hostname
, CycleCloud adds nodes to the node array named highmem
. If the node array doesn't exist, the job stays idle.
When you specify a machine type resource in a job submission, such as qsub -l machinetype:Standard_L32s_v2 my-job.sh
, CycleCloud autoscales the Standard_L32s_v2
machines in the execute
(default) node array. If the machine type isn't available in the execute
node array, the job stays idle.
You can use these resources together as:
qsub -l nodes=8:ppn=16:nodearray=hpc:machinetype=Standard_HB60rs my-simulation.sh
Autoscales only if you specify the Standard_HB60rs
machines in the hpc
node array.
Adding extra queues assigned to node arrays
On clusters with multiple node arrays, create separate queues to automatically route jobs to the appropriate VM type. In this example, assume the following gpu
node array is defined in your cluster template:
[[nodearray gpu]]
Extends = execute
MachineType = Standard_NC24rs
[[[configuration]]]
pbspro.slot_type = gpu
After you import the cluster template and start the cluster, run the following commands on the server node to create the gpu
queue:
/opt/pbs/bin/qmgr -c "create queue gpu"
/opt/pbs/bin/qmgr -c "set queue gpu queue_type = Execution"
/opt/pbs/bin/qmgr -c "set queue gpu resources_default.ungrouped = false"
/opt/pbs/bin/qmgr -c "set queue gpu resources_default.place = scatter"
/opt/pbs/bin/qmgr -c "set queue gpu resources_default.slot_type = gpu"
/opt/pbs/bin/qmgr -c "set queue gpu default_chunk.ungrouped = false"
/opt/pbs/bin/qmgr -c "set queue gpu default_chunk.slot_type = gpu"
/opt/pbs/bin/qmgr -c "set queue gpu enabled = true"
/opt/pbs/bin/qmgr -c "set queue gpu started = true"
Note
As shown in the example, the queue definition packs all VMs in the queue into a single virtual machine scale set to support MPI jobs. To define the queue for serial jobs and allow multiple virtual machine scale sets, set ungrouped = true
for both resources_default
and default_chunk
. Set resources_default.place = pack
if you want the scheduler to pack jobs onto VMs instead of round-robin allocation of jobs. For more information on PBS job packing, see the official PBS Professional OSS documentation.
PBS Professional configuration reference
The following table describes the PBS Professional (PBS Pro) specific configuration options you can toggle to customize functionality:
PBS Pro Options | Description |
---|---|
pbspro.slots | The number of slots for a given node to report to PBS Pro. The number of slots is the number of concurrent jobs a node can execute. This value defaults to the number of CPUs on a given machine. You can override this value in cases where you don't run jobs based on CPU but on memory, GPUs, and other resources. |
pbspro.slot_type | The name of the type of 'slot' a node provides. The default is 'execute'. When you tag a job with the hard resource slot_type=<type> , the job runs only on the machines with the same slot type. This setting lets you create different software and hardware configurations for each node and ensures that the right job is always scheduled on the correct type of node. |
pbspro.version | Default: '18.1.3-0'. This version is currently the default and only option to install and run. In the future, more versions of the PBS Pro software might be supported. |
Connect PBS with CycleCloud
CycleCloud manages OpenPBS clusters through an installable agent called azpbs
. This agent connects to CycleCloud to read cluster and VM configurations. It also integrates with OpenPBS to process the job and host information. You can find all azpbs
configurations in the autoscale.json
file, usually located at /opt/cycle/pbspro/autoscale.json
.
"password": "260D39rWX13X",
"url": "https://cyclecloud1.contoso.com",
"username": "cyclecloud_api_user",
"logging": {
"config_file": "/opt/cycle/pbspro/logging.conf"
},
"cluster_name": "mechanical_grid",
Important files
The azpbs
agent parses the PBS configuration each time it's called - jobs, queues, resources. The agent provides this information in the stderr and stdout of the command and to a log file, both at configurable levels. The agent also logs all PBS management commands (qcmd
) with arguments to a file.
You can find all these files in the /opt/cycle/pbspro/ directory where you install the agent.
File | Location | Description |
---|---|---|
Autoscale Config | autoscale.json | Configuration for Autoscale, Resource Map, CycleCloud access information |
Autoscale Log | autoscale.log | Agent main thread logging including CycleCloud host management |
Demand Log | demand.log | Detailed log for resource matching |
qcmd Trace Log | qcmd.log | Logging the agent qcmd calls |
Logging Config | logging.conf | Configurations for logging masks and file locations |
Defining OpenPBS Resources
This project enables you to associate OpenPBS resources with Azure VM resources through the cyclecloud-pbspro (azpbs) project. You define this resource relationship in autoscale.json
.
The cluster template includes the following default resources:
{"default_resources": [
{
"select": {},
"name": "ncpus",
"value": "node.vcpu_count"
},
{
"select": {},
"name": "group_id",
"value": "node.placement_group"
},
{
"select": {},
"name": "host",
"value": "node.hostname"
},
{
"select": {},
"name": "mem",
"value": "node.memory"
},
{
"select": {},
"name": "vm_size",
"value": "node.vm_size"
},
{
"select": {},
"name": "disk",
"value": "size::20g"
}]
}
The OpenPBS resource named mem
corresponds to a node attribute named node.memory
, which represents the total memory of any virtual machine. This configuration lets azpbs
handle a resource request like -l mem=4gb
by comparing the value of the job resource requirements to node resources.
Currently, the disk size is set to size::20g
. Here's an example of how to handle VM Size specific disk size:
{
"select": {"node.vm_size": "Standard_F2"},
"name": "disk",
"value": "size::20g"
},
{
"select": {"node.vm_size": "Standard_H44rs"},
"name": "disk",
"value": "size::2t"
}
Autoscale and scale sets
CycleCloud treats spanning and serial jobs differently in OpenPBS clusters. Spanning jobs land on nodes that are part of the same placement group. The placement group has a particular platform meaning VirtualMachineScaleSet with SinglePlacementGroup=true) and CycleCloud manages a named placement group for each spanned node set. Use the PBS resource group_id
for this placement group name.
The hpc
queue appends the equivalent of -l place=scatter:group=group_id
by using native queue defaults.
Installing the CycleCloud OpenPBS Agent azpbs
The OpenPBS CycleCloud cluster manages the installation and configuration of the agent on the server node. The preparation steps include setting PBS resources, queues, and hooks. You can also perform a scripted installation outside of CycleCloud.
# Prerequisite: python3, 3.6 or newer, must be installed and in the PATH
wget https://github.com/Azure/cyclecloud-pbspro/releases/download/2.0.5/cyclecloud-pbspro-pkg-2.0.5.tar.gz
tar xzf cyclecloud-pbspro-pkg-2.0.5.tar.gz
cd cyclecloud-pbspro
# Optional, but recommended. Adds relevant resources and enables strict placement
./initialize_pbs.sh
# Optional. Sets up workq as a colocated, MPI focused queue and creates htcq for non-MPI workloads.
./initialize_default_queues.sh
# Creates the azpbs autoscaler
./install.sh --venv /opt/cycle/pbspro/venv
# Otherwise insert your username, password, url, and cluster name here.
./generate_autoscale_json.sh --install-dir /opt/cycle/pbspro \
--username user \
--password password \
--url https://fqdn:port \
--cluster-name cluster_name
azpbs validate
CycleCloud supports a standard set of autostop attributes across schedulers:
Attribute | Description |
---|---|
cyclecloud.cluster.autoscale.stop_enabled | Enables autostop on this node. [true/false] |
cyclecloud.cluster.autoscale.idle_time_after_jobs | The amount of time (in seconds) for a node to sit idle after completing jobs before it autostops. |
cyclecloud.cluster.autoscale.idle_time_before_jobs | The amount of time (in seconds) for a node to sit idle before completing jobs before it autostops. |
Note
CycleCloud doesn't support the bursting configuration with Open PBS.
Note
Even though Windows is an officially supported Open PBS platform, CycleCloud doesn't support running Open PBS on Windows at this time.