4.7 Submitting Jobs with Slurm

Submitting Jobs with Slurm

Slurm is the job scheduler used by ParallelCluster. It manages job queues, allocates compute nodes, and handles scaling automatically.

Key Slurm Commands

Command	Description
`sinfo`	List partitions (queues) and node status
`squeue`	List queued and running jobs
`sbatch script.sh`	Submit a job script
`scancel <job-id>`	Cancel a job
`scontrol show job <job-id>`	Show detailed job info

Understanding Node States

When you run sinfo, nodes will show different states:

State	Description
`idle~`	No instance running; will launch when a job is submitted
`idle%`	Instance running; will shut down after idle timeout (default 10 min)
`mix`	Instance partially allocated
`alloc`	Instance fully allocated

Job Status Codes

When monitoring with squeue, the ST column shows:

Code	Status	Description
`PD`	Pending	Waiting for resource allocation
`CF`	Configuring	ParallelCluster is provisioning instances
`R`	Running	Job script is executing
`CG`	Completing	Job is finishing up

Writing a Job Script

A basic Slurm job script:

#!/bin/bash
#SBATCH --job-name=my-simulation
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log

# Load your software
spack load openfoam

# Run your simulation
mpirun -np 192 mySimulation

Submit it with:

sbatch my-job.sh

Monitoring Jobs

# Watch job queue in real-time
watch squeue

# Check node status
sinfo

# View detailed job info
scontrol show job <job-id>

How Auto-Scaling Works

ParallelCluster automatically manages compute nodes based on your job queue:

You submit a job with sbatch
Slurm sees the resource request and signals ParallelCluster
ParallelCluster launches the required compute instances
Instances join the cluster and the job starts running
When the job completes, instances stay idle for the cooldown period (default 10 minutes)
After the cooldown, idle instances are terminated

This means you only pay for compute when jobs are actually running.

Multiple Queues

Your cluster can have multiple queues with different instance types:

# Submit to a specific queue
sbatch --partition=hpc6a my-job.sh

# Submit to a different queue
sbatch --partition=c6i my-job.sh

Cleaning Up

When you’re done with your cluster, delete it to stop all charges:

PCUI: Select your cluster → Click Delete
CLI: pcluster delete-cluster --cluster-name my-hpc-cluster

Warning

Deleting a cluster removes all resources including storage. Back up your results to S3 before deleting.

Having issues? Check the Troubleshooting & FAQs.