Ingest from GWS
badc user used for ingestion on the ingest machines does not have access to content in JASMIN group workspace, which can make ingesting data held on them tricky. The cause is that the
badc user can not be added to the linux groups set up for the large number of GWS on JASMIN and also retain active membership of linux groups needed for archive activities. To get around this world read access needs to be set up for the required GWS.
Steps to getting world read access set up for a GWS
- Contact the GWS manager and obtain permission to give world read access to the GWS on the JASMIN system via the JASMIN helpdesk
- Once permission has been given contact a JASMIN team member with root access to the GWSs and ask them to set up world read access for the GWS (note to JASMIN team member to set this as
o+rxas opposed to
- Check that the GWS can be read from the ingest machine
- Inform the GWS manager that world read access has been set up and that they may wish to ensure that access to other GWS directories that should not be ingested from should be set to ensure no world read access.
- Once ingestion has been complete contact the GWS manager and arrange when the top level access should be reverted
Ingestion should be possible with the standard ingest tools on the ingest machines.
Note, there are a couple of development items in the pipeline which may help to alleviate this situation in the future. Documentation will be needed for these:
- ingestion initiated from the arrivals service by a reviewer
- ingest from GWS tool
Alternative way to get data from a GWS to archive
(a fudge - not officially recommended - but it works)
This method pulls the data from the GWS to /datacentre/processing3 from where it can then be archived in the usual way.
- Agree with the data provider that the data is to be archived (deposit conditions), and get the full path to the data on the GWS
- Apply for access to the GWS as your own user id on the JASMIN accounts portal https://accounts.jasmin.ac.uk/
- Log on to a JASMIN sci machine (as your username) and check you can access the data and that the path is correct. It is useful to check the size/shape of the data to transfer.
- As user badc on an ingest machine, make a directory in /datacentre/processing3 (or similar staging area) to temporarily hold the files.
- From this directory (4 above) rsync the data from the GWS to the staging area using
rsync -a email@example.com:/gws/[full path] /[directory_wanted] . (ie space dot)
rsync -a firstname.lastname@example.org:/gws/nopw/j04/gotham/wp4/cpdn_extracted_data_extra/b778-845_archive . &
If it fails check the path is correct
Once the rsync is complete and the data have been checked you can move it to the archive in one of the usual ways. If it is a large dataset and in a directory structure then the ingest route is recommended as you can specify a regex, multiple threads to deposit it eg nthreads=10 and arrivals_maxfiles:
This config file /home/badc/software/datasets/wgarland/wendyingest.cfg has an example you can copy/edit to do this - run it manually via ingest_control