How to monitor jobs

This article explains how to monitor batch jobs on LOTUS. It covers:

  • Job information
  • LSF commands for monitoring jobs 
  • History of jobs
  • Inspection of job output files

Job information

Information on your running batch jobs can be obtained via the bjobs command, with similar information available via the qstat command. Note that information on completed jobs is only retained for a limited period, with information on jobs that ran in the last week available via bhist . An example of the output from bjobs is shown below.

$ bjobs
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
6485340   msmizie RUN   short-serial      jasmin-sci1 host177.jc. R_job[2]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host232.jc. R_job[3]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host150.jc. R_job[8]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host092.jc. R_job[9]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host136.jc. R_job[4]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host267.jc. R_job[1]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host124.jc. R_job[7]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host157.jc. R_job[10]  Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host198.jc. R_job[6]   Jul 21 11:46
6485340   msmizie RUN   short-serial      jasmin-sci1 host238.jc. R_job[5]   Jul 21 11:46

In this case a 10 element job array has been submitted to the short-serial queue (QUEUE) from the server jasmin-sci1 (FROM_HOST). All 10 elements are currently running (STAT), each on a different execution host (EXEC_HOST).

By default bjobs will list all running jobs. Adding the argument -a will list all completed jobs and detailed output for all, or a single job can be found via the -l argument.

A summary of the number of jobs in different states is returned by bjobs -sum, and the various job states are defined in Table 1, below:

Table 1: job states

Job state Description
PEND Waiting in a queue for scheduling and dispatch
RUN Dispatched to a host and running
DONE Finished normally with zero exit value
EXIT Finished with non-zero exit value
PSUSP Suspended while pending
USUSP Suspended by user
SSUSP Suspended by the LSF system

LSF commands for monitoring jobs

A list of the most commonly used commands and their options for monitoring batch jobs are listed in the Table 2, below:

Table 2.  List of important LSF commands option for monitoring jobs 

LSF                Command Description

bjobs -a

Displays information about your jobs that are both running and those recently finished (PEND, RUN, USUSP, PSUSP, SSUSP, DONE, and EXIT statuses) 
bjobs -r Displays all your running jobs  (RUN status)
bjobs -p Displays information for pending jobs (PEND state) and their reasons 
bjobs -sum Shows a summary of the number of jobs in different states
bjobs -x Lists your running jobs and also shows the nodes which they are running on
bjobs -u all Lists all jobs running on the cluster
bjobs -u all -x As above but detailing the executing node for each job
bjobs -l JOBID Shows detailed information about your job (JOBID = job number) by searching the current event log file 
bhist -t -T Shows chronological history (-t) of jobs within a given time range (-T)
bhist -n 3 Searches lsb.events, lsb.events.1, lsb.events.2
bhist -n 0 Searches all LSF event log files    

History of jobs

LSF periodically backs up and prunes the job history log. By default, bhist only displays job history from the current event log file. You can display the history for jobs that completed some time ago and are no longer listed in the active event log.

The -n num_logfiles option tells bhist to search through the specified number of log files instead of only searching the current log file.

Log files are searched in reverse time order. For example, the command bhist -n 3 searches the current event log file and then the two most recent backup files (see Table 2).

Example:

$ bhist -n 1000  -l  6135690
Job <6135690[1]>, Job Name <Myarray[1]>, User <fchami>, Project <default>, Comm
                     and <#!/bin/bash; #BSUB -q lotus;#BSUB -o output.%J.%I ;#B
                     SUB -e error.%J-%I ;#BSUB -J "Myarray[1-4]";##BSUB -i inpu
                     t.%I ;  echo HOSTNAME set to: ${HOSTNAME};echo LSB_JOBID s
                     et to: $LSB_JOBID;echo LSB_JOBINDEX set to: $LSB_JOBINDEX;
                      cat input.$LSB_JOBINDEX;  #bsub -J "cdcn_array[1-2]" -o o
                     utput.%J.%I -e error.%I -i input.%I /home/users/cdelcano/a
                     rray/array_test>
Mon Jul 18 15:59:22: Submitted from host <lotus.jc.rl.ac.uk>, to Queue <lotus>,
                      CWD <$HOME/0workshop/array_example>, Output File <output.
                     %J.%I>, Error File <error.%J-%I>;
Mon Jul 18 15:59:26: Dispatched to <host135.jc.rl.ac.uk>, Effective RES_REQ <se
                     lect[type == local] order[r15s:pg] same[nodetype] >;
Mon Jul 18 15:59:26: Starting (Pid 31976);
Mon Jul 18 15:59:26: Running with execution home </home/users/fchami>, Executio
                     n CWD </home/users/fchami/0workshop/array_example>, Execut
                     ion Pid <31976>;
Mon Jul 18 15:59:26: Done successfully. The CPU time used is 0.1 seconds;
Mon Jul 18 15:59:26: Post job process done successfully;

Summary of time in seconds spent in various states by  Mon Jul 18 15:59:26
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  4        0        0        0        0        0        4           
-----------------------------------------------------------------------------<br>

reports historical information about jobs not listed in active event log.

Inspection of job output files

An example of the job output file, taken from one of the batch jobs in  Example Job 1: CMIP5 Archive Analysis, is shown below

Sender: LSF System <lsfadmin@host105.jc.rl.ac.uk>
Subject: Job 2933066: <python2.7 arctic_mean.py /badc/cmip5/data/cmip5/output1/MPI-M/MPI-ESM-LR/rcp85/mon/atmos/Amon/r3i1p1
/latest/tas/ ./data/> in cluster <lotus> Done

Job <python2.7 arctic_mean.py /badc/cmip5/data/cmip5/output1/MPI-M/MPI-ESM-LR/rcp85/mon/atmos/Amon/r3i1p1/latest/tas/ ./dat
a/> was submitted from host <jasmin-sci1-panfs.ceda.ac.uk> by user <msmizielinski> in cluster <lotus>.
Job was executed on host(s) <host105.jc.rl.ac.uk>, in queue <short-serial>, as user <msmizielinski> in cluster <lotus>.
</home/users/msmizielinski> was used as the home directory.
</home/users/msmizielinski/arctic_mean_example> was used as the working directory.
Started at Wed Apr 27 11:43:48 2016
Results reported on Wed Apr 27 11:43:51 2016

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
python2.7 arctic_mean.py /badc/cmip5/data/cmip5/output1/MPI-M/MPI-ESM-LR/rcp85/mon/atmos/Amon/r3i1p1/latest/tas/ ./data/
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time :                                   1.10 sec.
    Max Memory :                                 55.55 MB
    Average Memory :                             55.55 MB
    Total Requested Memory :                     -
    Delta Memory :                               -
    Max Swap :                                   456 MB
    Max Processes :                              3
    Max Threads :                                4

The output (if any) follows:

/usr/lib/python2.7/site-packages/iris/fileformats/cf.py:794: UserWarning: Missing CF-netCDF measure variable u'areacella', referenced by netCDF variable u'tas'
  warnings.warn(message % (variable_name, nc_var_name))

Some of the important components to note here are

  • the host used (host105)
  • the command launched,
  • the working and home directories
  • the CPU time used (1.10 seconds)
  • the maximum memory used (55.55 MB here) in order to define appropriate memory requirements/limitations (see how to estimate job resources and how to allocate job resources
  • the standard output, and error in this case, from the command

Still need help? Contact Us Contact Us