Troubleshooting Jobs
When a SLURM job fails to start or runs into errors, there are several tools and techniques to help diagnose and fix the issue.
Common Issues and Solutionsβ
π« Job Wonβt Startβ
Check | Explanation |
---|---|
squeue -u $USER | View your job's status in the queue. It might be waiting for resources. |
scontrol show job <id> | Get detailed info on job state, pending reasons, and requested resources. |
Resource Limits | SLURM may delay your job if --mem , --cpus , or --time is too high. |
Partition Constraints | Ensure you're submitting to the correct partition, if applicable. |
Node Availability | Use sinfo to check if nodes are online and available. |
Try lowering resource requests temporarily or switching partitions (if available) to test submission.
β Job Fails Immediatelyβ
Symptom | Solution |
---|---|
.err file has errors | Always check this file for Python, R, Bash, or SLURM errors. |
"command not found" | Ensure the right modules or environments are loaded in the script. |
Segmentation fault or core dump | Likely a bug in your code or incompatible moduleβrun a smaller test. |
Permission denied | Check file and directory permissions in your script or input/output paths. |
Syntax errors | Review your script syntax carefully (Bash/SLURM). Use shellcheck to lint. |
Also confirm that any scripts are executable:
chmod +x script.sh
π Helpful Debugging Tipsβ
Add logging:
echo "Starting at $(date)"
Print environment:
env
Add set -x
in bash scripts to trace execution.
π Debugging with an Interactive Sessionβ
You can test interactively before submitting a batch job:
salloc --time=01:00:00 --ntasks=1 --mem=4G
π¬ Still Stuck?β
Contact support@ace-bioinformatics.org with:
-
Job ID
sacct -j <id>
-
Your SLURM script
-
Error file output
-
Any steps you've already taken to debug
This helps the support team resolve the issue faster.