Job Submission Tips
Submitting jobs responsibly ensures fair access to compute resources for all ACE users. Request only what you need, test before scaling up, and monitor your jobs to understand their resource usage.
Request Only What You Need
Over-requesting resources wastes cluster capacity and increases your queue wait time.
Nodes and Tasks
# BAD: Requesting more resources than needed
#SBATCH --nodes=10
#SBATCH --ntasks-per-node=32
# ... but your code only uses 1 node
# GOOD: Match resources to your actual needs
#SBATCH --nodes=1
#SBATCH --ntasks=8
Time Limits
Request realistic time limits:
# BAD: Always requesting maximum time
#SBATCH --time=7-00:00:00 # 7 days "just in case"
# GOOD: Based on tested runtime + buffer
#SBATCH --time=04:00:00 # 4 hours (tested: 3 hours + 1 hour buffer)
Why this matters:
- Shorter jobs can backfill into gaps, starting sooner
- Over-requested time blocks resources others could use
- The scheduler prioritizes jobs that fit available windows
Memory
# Specify memory only if you know your requirements
#SBATCH --mem=16G # Total memory per node
#SBATCH --mem-per-cpu=4G # Memory per CPU core
Test Before Scaling
Always verify your job works at small scale before requesting large resources.
Interactive Testing
# Start an interactive session
srun --nodes=1 --ntasks=1 --time=00:30:00 --pty bash
# Test your code runs
./my_program --test-mode
Small Batch Test
# Submit a minimal test job
#SBATCH --nodes=1
#SBATCH --ntasks=2
#SBATCH --time=00:30:00
./my_program --small-input
Scale Up Gradually
# After small tests succeed, gradually increase
# 1 node → 2 nodes → 4 nodes → target size
# Monitor each step before proceeding
Monitor Resource Usage
Understanding how your jobs use resources helps you request appropriately.
During Job Execution
# Check running jobs
squeue -u $USER
# Get detailed job info
scontrol show job <jobid>
# SSH to compute node and check resources (if allowed)
ssh <nodename>
top -u $USER
After Job Completion
# View job efficiency
seff <jobid>
# Example output:
# Job ID: 12345
# CPU Efficiency: 85.2%
# Memory Efficiency: 62.3% of 16.00 GB
If efficiency is low, reduce your resource requests.
Using sacct
# View completed job statistics
sacct -j <jobid> --format=JobID,Elapsed,MaxRSS,MaxVMSize,CPUTime
# View your recent jobs
sacct -u $USER --starttime=2024-01-01 --format=JobID,JobName,Elapsed,State
Job Array Best Practices
Job arrays are efficient for parameter sweeps but require care:
# GOOD: Reasonable array size with throttling
#SBATCH --array=1-100%10 # 100 tasks, max 10 running at once
# BAD: Thousands of tiny tasks flooding the scheduler
#SBATCH --array=1-10000 # No throttling
Throttle Large Arrays
# Limit concurrent tasks
#SBATCH --array=1-1000%50 # Max 50 running simultaneously
Combine Small Tasks
If each task runs for only seconds, combine them:
# Instead of 10,000 one-second tasks
# Create 100 tasks that each process 100 items
Avoid Common Mistakes
Don't Flood the Queue
# BAD: Submitting thousands of jobs in a loop
for i in {1..5000}; do
sbatch job_$i.sh
done
# GOOD: Use job arrays
#SBATCH --array=1-5000%100
Don't Ignore Failed Jobs
# Check for failures
sacct -u $USER --state=FAILED --starttime=2024-01-01
# Investigate before resubmitting
cat slurm-<jobid>.out
Don't Hardcode Paths
# BAD: Hardcoded paths that break when moved
cd /home/username/project
./run.sh
# GOOD: Use environment variables
cd $SLURM_SUBMIT_DIR
./run.sh
Dependency Chains
Use job dependencies instead of sleep loops:
# Submit first job
JOB1=$(sbatch --parsable preprocess.sh)
# Submit dependent job
sbatch --dependency=afterok:$JOB1 analysis.sh