Catalogue Content Tidying

Outline

Over time we need to refind the catalogue content, especially with older information that remains in a poor state for historical reasons. These notes indicate how to safely tidy up content to ensure that:

  • information is not lost 
  • existing URLs that may have been used externally resolve to an equivalent, meaningful resources
  • duplications/variations are avoided

The notes will be in two parts:

  1. General principals and workflows
  2. Specific record types and workflows

NOTE - these comments below refer to the main record types in the catalogue, ie. Observations, Observation Collections, Projects, Platforms, Computations and Instruments records. Party records are also covered, but as these are a different type of object in the catalogue they have a dedicated section that needs to be referred to. This does NOT touch on other types of record within the catalogue which are often sub-components of the main record types. For guidance on those please contact the catalogue manager.

General Principals

0. Archive content 'retirement' has a set procedure to follow... (inc empty directories)

These notes are for catalogue content that isn't related to removal/migration of archive content as well... if you are tidying up the archive then see the Remove Data Procedure

1. Before deleting anything be aware of the knock-on effects!

As the catalogue is a relational database any given record may be linked to from other records. For the principal record types there  should be a banner at the top of the record indicating any connections that may be present:

Some of the connections are a little hidden - e.g. the link between an Observation and an Instrument/Platform record will go through the intermediately 'Acquisition' object... and there may also be a Composite Process object that sits between an Observation and the Acquisition too.

2. Identify an equivalent end-point/record

Often tidying up content is due to poorly structured content or duplicates, meaning that we need to re-work the way that the cataloguing is done to cover a part of the archive. However, whilst this is good we need to find what will the be replacement record that takes the place of the content we're aiming to remove. Once you've done that you can then move on to step 3.

It might also be possible that the record is just an orphaned record with no connection to anything else in the catalogue or, more importantly, the archive and so there's no equivalent record to replace it and it will be a simple case of deletion. In this case also check if there is a need to record this step back from publishing something - e.g. record the fact that data were not delivered on a JIRA ticket for the project so that we have a record that X was expected but not delivered (it has been practice at times to spin up catalogue content in anticipation of content being delivered which has then failed to materialise).

3. There's a tool for that...

For cases where there's duplicate records or a bulk (say more than 10 records) operation to be done there may be an existing tool to do the 'merger' or can be written to make the process quicker. Check out the Tools section below to see what has been listed and/or speak the the catalogue manager about these.

4. Switch your record linking

If you have other records linking to this record that you wish to remove (e.g. an obsolete party record) then it's important to reconnect those other records to your equivalent record. For some cases we may already have a tool to help with this (see Tools section below), but before that is done there may be a need to complete information transfer...

5. Retain information and identifiers

The record that you want to retire may have various metadata elements that contain useful information that need to be ported over to your equivalent record. In particular:

  • any previously used 'identifiers' 
  • the present 'MOLES 3' url for the record

This is because we have redirect services in place that do a catalogue look-up to resolve those URLs to the equivalent records. Add these to the 'identifiers' section of the record you are retaining.
Other items to retain include details stored in other fields on the record, especially the Abstract and Migrated Properties. See specific records below for other fields and things to note.

A word on 'Migrated Properties' 

... aka the "More Info (under review)" content at the bottom of the 'Details' tab.

Moving to the present catalogue from the previous instantiation was mostly a smooth operation mapping content over to equivalent records/fields in the latest catalogue instantiation. However, there were various fields that could not be directly mapped over and so were stored in the 'migrated properties' section, requiring additional work to migrate the information to some equivalent place in due course. In most cases this is things like moving documentation to pages in the CEDA Artefact Service (for content that may be changed), CEDA Document Repository (for fixed content) or moving links to the 'online resource' section of your replacement record.

More needs to be done on this generally, so speak to the catalogue manager about this to prompt further work!

5. Removal of the record

If you're not using a 'merger' tool as noted below, once you've moved content from the record and you're happy to have it removed then you should be good to go!

General Workflows

Observation Records

Party Records

 - see the tools section

Observation/Observation Collection

 - Results. If there was a Result linked to the record then don't delete the Result when coming to remove your Observation record. Instead add this as an 'old data path' on your new Observation record's Result object.

There's a Tool For That... (perhaps)

For duplicate entries that need 'merging' - i.e. switching connections to one record to link up with another record you want to retain and the remove the obsolete record (note, these merger tools will also handle any existing links to records that are to be de-duplicated):

  • party merger - removes duplicate Party record instances
  • duplicate image remover - removed duplicate image instances
  • add_sam - add the archive manager as the latest CEDA Officer where needed on records
  • new_ceda_officer - adds a given person as the latest CEDA Officer when a previous staff member moves on
  • composite merger - merges composite record content
  • update docs urls - update online resource links
  • tag for export - tags records for export into the NERC Data Catalogue Service
  • add online resource tool - adds new online resources to a set of records defined by a list of UUIDs
  • tag_obs_cols - need to check 
  • procedure merger  - need to check
  • id_merer  - need to check

No suitable tool but you're doing bulk changes? The catalogue manager can help script some of this work.

Did this answer your question? Thanks for the feedback There was a problem submitting your feedback. Please try again later.

Still need help? Contact Us Contact Us