SLURM Job Scheduler Guide
SLURM (Simple Linux Utility for Resource Management) is the workload manager used on our cluster. It allocates resources, schedules jobs, and manages job queues.
Cluster Partitions
The cluster is divided into partitions (queues) with different resource limits:
compute
Standard compute partition for general-purpose jobs.
- Max runtime: 7 days
- Max cores per job: 64
- Max memory per job: 256 GB
debug
Quick testing and debugging. Higher priority but limited resources.
- Max runtime: 1 hour
- Max cores per job: 8
- Max memory per job: 32 GB
long
For long-running jobs that need extended time.
- Max runtime: 30 days
- Max cores per job: 32
- Max memory per job: 128 GB
Basic Commands
View Cluster Status
# Show all partitions and their status
sinfo
# Show detailed node information
sinfo -N -l
# Show only available (idle) nodes
sinfo -t idle
Submit Jobs
# Submit a batch script
sbatch myjob.sh
# Submit with specific resources
sbatch --partition=debug --time=00:30:00 --cpus-per-task=4 myjob.sh
# Submit interactive job
srun --pty --partition=debug --time=01:00:00 bash
Monitor Jobs
# Show your jobs
squeue -u $USER
# Show all jobs in detail
squeue -l
# Show job details
scontrol show job JOBID
# View job output in real-time
squeue -j JOBID -o "%o" # Get output file name
tail -f slurm-JOBID.out
Manage Jobs
# Cancel a job
scancel JOBID
# Cancel all your jobs
scancel -u $USER
# Hold a job (prevent it from running)
scontrol hold JOBID
# Release a held job
scontrol release JOBID
Sample Job Scripts
Basic Serial Job
#!/bin/bash
#SBATCH --job-name=my_serial_job
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --output=%x_%j.out
echo "Starting job at $(date)"
./my_program
echo "Job finished at $(date)"
Parallel MPI Job
#!/bin/bash
#SBATCH --job-name=mpi_job
#SBATCH --partition=compute
#SBATCH --time=02:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --mem-per-cpu=2G
#SBATCH --output=%x_%j.out
module load openmpi
mpirun ./my_mpi_program
GPU Job (if available)
#!/bin/bash
#SBATCH --job-name=gpu_job
#SBATCH --partition=gpu
#SBATCH --time=04:00:00
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
module load cuda
./my_gpu_program
Best Practices
Tips for Efficient Cluster Usage
- Always request only the resources you need
- Use the debug partition for testing before submitting long jobs
- Specify walltime accurately - shorter jobs get scheduled faster
- Use job arrays for parameter sweeps instead of many individual jobs
- Save intermediate results - jobs may be preempted or fail
- Use scratch space for temporary files, home for important data
Job Array Example
Job arrays are useful for running the same computation with different parameters:
#!/bin/bash
#SBATCH --job-name=array_job
#SBATCH --array=1-10
#SBATCH --partition=compute
#SBATCH --time=01:00:00
#SBATCH --cpus-per-task=2
echo "Processing task $SLURM_ARRAY_TASK_ID"
./my_program --input input_$SLURM_ARRAY_TASK_ID.dat
Additional Resources