Troubleshooting Jobs
When a SLURM job fails to start or runs into errors, there are several tools and techniques to help diagnose and fix the issue.
Common Issues and Solutionsβ
π« Job Wonβt Startβ
| Check | Explanation |
|---|---|
squeue -u $USER | View your job's status in the queue. It might be waiting for resources. |
scontrol show job <id> | Get detailed info on job state, pending reasons, and requested resources. |
| Resource Limits | SLURM may delay your job if --mem, --cpus, or --time is too high. |
| Partition Constraints | Ensure you're submitting to the correct partition, if applicable. |
| Node Availability | Use sinfo to check if nodes are online and available. |
Try lowering resource requests temporarily or switching partitions (if available) to test submission.
β Job Fails Immediatelyβ
| Symptom | Solution |
|---|---|
.err file has errors | Always check this file for Python, R, Bash, or SLURM errors. |
| "command not found" | Ensure the right modules or environments are loaded in the script. |
| Segmentation fault or core dump | Likely a bug in your code or incompatible moduleβrun a smaller test. |
| Permission denied | Check file and directory permissions in your script or input/output paths. |
| Syntax errors | Review your script syntax carefully (Bash/SLURM). Use shellcheck to lint. |
Also confirm that any scripts are executable:
chmod +x script.sh
π Helpful Debugging Tipsβ
Add logging:
echo "Starting at $(date)"
Print environment:
env
Add set -x in bash scripts to trace execution.
π Debugging with an Interactive Sessionβ
You can test interactively before submitting a batch job:
salloc --time=01:00:00 --ntasks=1 --mem=4G
π¬ Still Stuck?β
Contact support@ace-bioinformatics.org with:
-
Job ID
sacct -j <id> -
Your SLURM script
-
Error file output
-
Any steps you've already taken to debug
This helps the support team resolve the issue faster.