Managing Jobs
In the [/docs/jobs](running jobs) page you can see how to submit jobs to the batch queue. Here we will look at how to monitor the progress of jobs you have submitted, or look at the metrics from jobs you have submitted in the past.
Looking at the queue
Once you have submitted some jobs to the queue there will be no indication of progress until the job is returned to you. However, you can monitor the current state of the queue to see what jobs you have running and where they are. For simple monitoring there is a command line tool called show_queue which will show you what is currently in the queue. If you run this it will show you a summary of all of the jobs that you are currently running. By default it will show jobs from all users, but if you add -u [your username] then it will only show your jobs.
$ show_queue -u inglesfs
JOBID PARTITI NAME USER STATE TIME MIN_MEM TASKS MIN_CPU NODELIST
83842 normal wait_R2 inglesfs PENDING 0:00 1G 1 1 Dependency
89681 normal nf-BISMARK_(EM_He inglesfs RUNNING 4:05:56 20G 1 5 compute-0-8
89677 normal nf-BISMARK_(EM_Bu inglesfs RUNNING 4:07:07 20G 1 5 compute-0-2
89675 normal nf-BISMARK_(EM_Bu inglesfs RUNNING 4:07:09 20G 1 5 compute-0-10
89672 normal nf-BISMARK_(EM_So inglesfs RUNNING 4:11:07 20G 1 5 compute-0-3
89668 normal nf-BISMARK_(EM_He inglesfs RUNNING 4:11:14 20G 1 5 compute-0-7
89664 normal nf-BISMARK_(EM_Bu inglesfs RUNNING 4:13:07 20G 1 5 compute-0-7
You’ll obviously use your own username in the command (unless you want to see what everyone is running)
The STATE column tells you the status of each job. The main ones you’re likely to see are:
• RUNNING The job is running • PENDING The job is paused waiting for resources to become available or dependent jobs to finish
Initially, if you submit multiple jobs you may only see one of them starting, but others should start after a brief delay. The cluster will work through your jobs in order until they are all complete.
Monitoring completed jobs
If you want to look at jobs which have finished running to see if they completed OK, how long they took, and how much memory they used then you can do this using the show_jobs command.
$ show_jobs
ID Date Name CPU Rmem Umem Elapsed Stat
----- ----------- --------------- --- ----- ----- --------- ----
89790 24/09@15:44 test_job 2 2.0 0.0 2s COMP
89789 24/09@15:44 test_job 2 2.0 0.0 2s COMP
89788 24/09@15:44 test_job 2 2.0 0.0 2s COMP
89787 24/09@15:44 test_job 2 2.0 0.0 1s COMP
89786 24/09@15:44 test_job 2 2.0 0.0 1s COMP
89689 24/09@12:26 juypterserv 1 20.0 0.1 33s CANC
89688 24/09@12:23 juypterserv 1 20.0 0.1 44s CANC
89686 24/09@12:17 juypterserv 1 20.0 0.1 2m24s CANC
89684 24/09@12:12 bash 1 19.5 0.0 1m12s COMP
89683 24/09@12:08 juypterserv 1 20.0 0.0 2s FAIL
By default this will show you your 20 most recent jobs which finished in the last 2 weeks. If you want to see more jobs then you can add -n 100 to show you the last 100 jobs.
The last column in this output says what happened to the job
COMP = completed successfully CANC = cancelled by you FAIL = job exited in an error state OOM = job used more memory than it was allocated
The Rmem and Umem show the amount of memory requested, and then actually used.
Removing jobs from the queue
If you decide that having started a job you no longer want it to run then you can delete it from the queue at any time. You can only delete your own jobs though.
To remove a job first use squeue to find the job id for the job you want to remove. You can then use scancel to remove the job. You can specify multiple ids separated by spaces.
$ show_queue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2914 normal test1 rodrigue R 0:04 1 pebble001
2915 normal test2 rodrigue R 0:04 1 pebble001
2916 normal test3 rodrigue R 0:04 1 pebble001
2917 normal test4 rodrigue R 0:04 1 pebble001
2918 normal test5 rodrigue R 0:04 1 pebble002
2919 normal test6 rodrigue R 0:04 1 pebble002
2920 normal test7 rodrigue R 0:04 1 pebble002
2921 normal test8 rodrigue R 0:04 1 pebble002
2922 normal test9 rodrigue R 0:04 1 pebble003
2923 normal test10 rodrigue R 0:04 1 pebble003
$ scancel 2918 2919 2920
$ show_queue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2914 normal test1 rodrigue R 0:04 1 pebble001
2915 normal test2 rodrigue R 0:04 1 pebble001
2916 normal test3 rodrigue R 0:04 1 pebble001
2917 normal test4 rodrigue R 0:04 1 pebble001
2921 normal test8 rodrigue R 0:04 1 pebble002
2922 normal test9 rodrigue R 0:04 1 pebble003
2923 normal test10 rodrigue R 0:04 1 pebble003