4.7 Submitting Jobs with Slurm

Submitting Jobs with Slurm

Slurm is the job scheduler used by ParallelCluster. It manages job queues, allocates compute nodes, and handles scaling automatically.

Key Slurm Commands

Command Description
sinfo List partitions (queues) and node status
squeue List queued and running jobs
sbatch script.sh Submit a job script
scancel <job-id> Cancel a job
scontrol show job <job-id> Show detailed job info

Understanding Node States

When you run sinfo, nodes will show different states:

State Description
idle~ No instance running; will launch when a job is submitted
idle% Instance running; will shut down after idle timeout (default 10 min)
mix Instance partially allocated
alloc Instance fully allocated

Job Status Codes

When monitoring with squeue, the ST column shows:

Code Status Description
PD Pending Waiting for resource allocation
CF Configuring ParallelCluster is provisioning instances
R Running Job script is executing
CG Completing Job is finishing up

Writing a Job Script

A basic Slurm job script:

#!/bin/bash
#SBATCH --job-name=my-simulation
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=96
#SBATCH --output=output_%j.log
#SBATCH --error=error_%j.log

# Load your software
spack load openfoam

# Run your simulation
mpirun -np 192 mySimulation

Submit it with:

sbatch my-job.sh

Monitoring Jobs

# Watch job queue in real-time
watch squeue

# Check node status
sinfo

# View detailed job info
scontrol show job <job-id>

How Auto-Scaling Works

ParallelCluster automatically manages compute nodes based on your job queue:

  1. You submit a job with sbatch
  2. Slurm sees the resource request and signals ParallelCluster
  3. ParallelCluster launches the required compute instances
  4. Instances join the cluster and the job starts running
  5. When the job completes, instances stay idle for the cooldown period (default 10 minutes)
  6. After the cooldown, idle instances are terminated

This means you only pay for compute when jobs are actually running.

Multiple Queues

Your cluster can have multiple queues with different instance types:

# Submit to a specific queue
sbatch --partition=hpc6a my-job.sh

# Submit to a different queue
sbatch --partition=c6i my-job.sh

Cleaning Up

When you’re done with your cluster, delete it to stop all charges:

  • PCUI: Select your cluster → Click Delete
  • CLI: pcluster delete-cluster --cluster-name my-hpc-cluster
Warning

Deleting a cluster removes all resources including storage. Back up your results to S3 before deleting.

Having issues? Check the Troubleshooting & FAQs.