Ingest control

Cron is used to manage ingestion jobs. This enables data scientists to schedule jobs to run regularly at certain times, days or dates.

To make sure ingestion processes have a common way of reporting if they have failed, and to keep track of output, a standard script is used to control each run. This script is called icwrapper. This script also allows the process to be locked so that only one instance of the script runs at any one time and can email messages summarising the errors and output from the job.

The icwrapper script is controlled by configuration files with each section defining an ingest stream. See below for options used by the script.

A command line tool "ingest_control" is used to simplify frequent tasks, such as listing ingest streams and loading the cron schedule. It can also be used to interrogate the output and status of the previous run.

Example configuration file

Ingest is controlled via simple configuration files. These are usually under /home/badc/software/datasets, although they can be in other places if there is a particular reason to have them somewhere else. These look like this

[DEFAULT]
notify_warning= sam.pepler@stfc.ac.uk
notify_fail= sam.pepler@stfc.ac.uk
working_dir = /home/badc/software/datasets/testdata

[test-stream1]
when = 30 8 * * *
script =./test.py test.conf test-stream1
#script = sleep 300                                                                                                                           
notify_ok= sam.pepler@stfc.ac.uk
arrivals_users = spepler wgarland
arrivals_wait = 30

[test-stream2]
when = 32 1 * * *
script = /usr/local/ingest_software/ingest_to_archive/trunk/arrivals_monitor.py
notify_ok = sam.pepler@stfc.ac.uk
arrivals_users = wgarland
arrivals_wait = 5
working_dir = /home/badc/software/datasets/testdata
arrivals_monitor_file = am_test.txt

Each section defines an ingest stream. The DEFAULT section allow default attributes to be given to all the streams in one config file.

The icwrapper script knows the following stream options:

Option	Example	Default	Comments
owner	spepler		Who to contact about this job
when	* * * * *	With no when option the stream will be marked as for running manually.	cron time syntax.
do_not_run	on	off	Stops the job being scheduled
errors_ok	on	off	Marks process as ok-errors rather than warn is there is any standard errors.
notify_warning	s.peplr@stfc.ac.uk
notify_fail	badc@rl.ac.uk sam.pepler@stfc.ac.uk
notify_ok	s.peplr@stfc.ac.uk
lock	on	on	only one process can run
script	/x/y/script.py my.conf stream3		This is the only mandatory option. ingest_control will ignore sections without this option.
retry	3	0	number of times to retry on fail
timeout	2	12	hours until the process is killed
working_dir	/home/badc/software/datasets/omi-toms	cron scripts will run in the home dir. manual scripts will run in the users current directory.	working directory. Allows relative paths to be used for the script and config file arguments.
cron_host	ingest2.ceda.ac.uk	ingest1.ceda.ac.uk	Run this ingest script on a different ingest host. Requires the full host name.
conda_env	ingest_py	ingest	Switch to this conda environment to run the job.

Running ingest control

Ingest control is installed on ingest1.ceda.ac.uk. The tool is only used by the user badc.

SSTDMSJP01$ ssh badc@ingest1.ceda.ac.uk

(venv27)[badc@ingest1 ~]$ ingest_control 
Ingest control> help
Documented commands (type help <topic>):
========================================
ALLSTART  a           edit  grep  list      q         reload  tailerr   w    
ALLSTOP   crontab     err   kill  listconf  quit      run     tailout   watch
EOF       deregister  f     l     out       register  show    timeline

Undocumented commands:
======================
help

Ingest control>

Used the help command to find out what the commands do.