How to control jobs

This article shows how to control jobs that have been submitted to LOTUS. It covers the following:

  • How to modify job options (Wall time, resource reservation)
  • Suspend and resume a job
  • Move a job to the top or bottom of a queue
  • Move a job between queues
  • Kill a job

How to modify job options

Job submission parameters can be modified using the command bmod depending on the state of the job. Jobs can be modified when they are both pending and running.

Modify a pending job

If your submitted jobs are pending ( bjobs shows the job in the "PEND" state) you can modify the job submission parameters. You can also modify entire job arrays or individual elements of a job array.

  • To replace the job command-line, run bmod -Z "new_command". For example: bmod -Z "myjob file" 101.
  • To change a specific job parameter, run bmod -b. The specified options replace the submitted options.

The following example changes the start time of job 101 to 2:00 a.m.:

$ bmod -b 2:00 101

  • To reset an option to its default submitted value (undo a bmod), append the n character to the option name and do not include an option value.

The following example resets the start time for job 101 back to its default value:

$ bmod -bn 101 

Modify a running job

If your submitted job is running ( bjobs shows the job in the "RUN" state) you can modify only some of the job options such as  including resource reservation, wall time and memory limit. You must be the job owner or an LSF administrator to modify a running job. 

Modify the resource reservation

A job is usually submitted with a resource reservation for the maximum amount required. Use this command to decrease the reservation.  Run bmod -R to modify the resource reservation for a running job.

For example, to modify the resource reservation for job 101  to 20GB of memory:

$ bmod -R "rusage[mem=20000]" 101

Modify job options 

The appropriate options for the command bmod in order to modify the run-time limit, memory limit and job error files for a running job are:

  • Run limit: -We <HH:MM> <job_id> | -Wen
  • Memory limit: -M <mem_limit> <job_id> | -Mn
  • Standard error file name: -e <error_file> <job_id> | -en
  • Standard output file name: -o <output_file> <job_id> | -on

Suspend and resume a job

You can resume or suspend a job using the bstop and bresume commands. A job can be suspended by its owner or the LSF administrator with the bstop command. These jobs are considered user-suspended and are displayed by bjobs as "USUSP".

When the user restarts the job with the bresume command, the job is not started immediately to prevent overloading. Instead, the job is changed from "USUSP" to "SSUSP" (suspended by the system). The "SSUSP" job is resumed when host load levels are within the scheduling thresholds for that job, similarly to jobs suspended due to high load. 

For example to stop and then resume job 6678, enter the following 

$ bstop 6678				
Job <6678> is being stopped						
$ bresume 6678
Job <6678> is being resumed		

Move a job to the bottom/top of a queue 

Use the LSF command  bbot to move jobs relative to your last job in the queue. You must be an LSF administrator or the user who submitted the job.  Use btop to move jobs relative to your first job in the queue. By default, LSF dispatches jobs in a queue in the order of arrival (that is,first-come-first-served), subject to availability of suitable compute hosts. Please consult the LSF Documentation for more information.

Move a job between queues

A pending job can be moved from its current queue to a different queue by using the bmod -q <queue-name> <jobID> command, where <queue-name> is the selected queue to which a job is moved to. An example to move a job with <jobid> 4105225 from the par-multi queue  to the par-single queue is shown here:

$ bjobs
4105225   fchami  PEND  par-multi  jasmin-sci1             multinodes Oct 31 14:28

$ bmod -q par-single  4105225  
Parameters of job <4105225> are being changed

$ bjobs
4105225   fchami  PEND  par-single jasmin-sci1             multinodes Oct 31 14:28

Note: Resources that the job requests must be within the resource allocation limits of the selected queue.

Kill a job

You can cancel a job from running or pending by killing it. The bkill command causes LSF to send the SIGINT and SIGTERM signals to a job to give it a chance to clean up, and then LSF sends the SIGKILL signal to kill the job. Example to kill a job with jobID 3421 is shown here:

$ bkill 3421 Job <3421> is being killed 		

Note: If you use the jobscommand immediately after the bill command on a running job, it will often show the job as still being in the RUN state. This is normal. There is no need to issue another bkill command. Doing so will not kill the job any faster. It sometimes takes several minutes for a “bkill” command to end a large parallel job.

To kill all of your pending jobs you can use the following combination of LSF and Linux commands where <username> is your username:

$ bkill `bjobs -u <username> |grep PEND |cut -f1 -d" "` 						

If a job cannot be killed in the operating system, you can force the removal of the job from LSF. The  bkill -r command removes a job from the system without waiting for the job to terminate in the operating system. This sends the same series of signals as  bkill except that the job is removed from the system immediately.

Still need help? Contact Us Contact Us