Ingest control
Cron is used to manage ingestion jobs. This enables data scientists to schedule jobs to run regularly at certain times, days or dates.
To make sure ingestion processes have a common way of reporting if they have failed, and to keep track of output, a standard script is used to control each run. This script is called icwrapper. This script also allows the process to be locked so that only one instance of the script runs at any one time and can email messages summarising the errors and output from the job.
The icwrapper script is controlled by configuration files with each section defining an ingest stream. See below for options used by the script.
A command line tool "ingest_control" is used to simplify frequent tasks, such as listing ingest streams and loading the cron schedule. It can also be used to interrogate the output and status of the previous run.
Example configuration file
Ingest is controlled via simple configuration files. These are usually under /home/badc/software/datasets, although they can be in other places if there is a particular reason to have them somewhere else. These look like this
[DEFAULT] notify_warning= sam.pepler@stfc.ac.uk notify_fail= sam.pepler@stfc.ac.uk working_dir = /home/badc/software/datasets/testdata [test-stream1] when = 30 8 * * * script =./test.py test.conf test-stream1 #script = sleep 300 notify_ok= sam.pepler@stfc.ac.uk arrivals_users = spepler wgarland arrivals_wait = 30 [test-stream2] when = 32 1 * * * script = /usr/local/ingest_software/ingest_to_archive/trunk/arrivals_monitor.py notify_ok = sam.pepler@stfc.ac.uk arrivals_users = wgarland arrivals_wait = 5 working_dir = /home/badc/software/datasets/testdata arrivals_monitor_file = am_test.txt
Each section defines an ingest stream. The DEFAULT section allow default attributes to be given to all the streams in one config file.
The icwrapper script knows the following stream options:
Option | Example | Default | Comments |
owner | spepler | Who to contact about this job | |
when | * * * * * | With no when option the stream will be marked as for running manually. | cron time syntax. |
do_not_run | on | off | Stops the job being scheduled |
errors_ok | on | off | Marks process as ok-errors rather than warn is there is any standard errors. |
notify_warning | s.peplr@stfc.ac.uk | ||
notify_fail | badc@rl.ac.uk sam.pepler@stfc.ac.uk | ||
notify_ok | s.peplr@stfc.ac.uk | ||
lock | on | on | only one process can run |
script | /x/y/script.py my.conf stream3 | This is the only mandatory option. ingest_control will ignore sections without this option. | |
retry | 3 | 0 | number of times to retry on fail |
timeout | 2 | 12 | hours until the process is killed |
working_dir | /home/badc/software/datasets/omi-toms | cron scripts will run in the home dir. manual scripts will run in the users current directory. | working directory. Allows relative paths to be used for the script and config file arguments. |
cron_host | ingest2.ceda.ac.uk | ingest1.ceda.ac.uk | Run this ingest script on a different ingest host. Requires the full host name. |
conda_env |
ingest_py | ingest | Switch to this conda environment to run the job. |
Running ingest control
Ingest control is installed on ingest1.ceda.ac.uk. The tool is only used by the user badc.
SSTDMSJP01$ ssh badc@ingest1.ceda.ac.uk (venv27)[badc@ingest1 ~]$ ingest_control Ingest control> help Documented commands (type help <topic>): ======================================== ALLSTART a edit grep list q reload tailerr w ALLSTOP crontab err kill listconf quit run tailout watch EOF deregister f l out register show timeline Undocumented commands: ====================== help Ingest control>
Used the help command to find out what the commands do.