Dataverse Integration¶
Overview¶
Dataverse is an open source repository for research data. Archivematica can be configured to use a Dataverse Repository as a Transfer Source Location. Dataverse Transfer Source Locations can be configured to display all available datasets or a subset of them.
Datasets are retrieved directly using the Dataverse API and processed using the Dataverse transfer type, which enables some additional processing steps as described below.
Dataverse integration is supported in Archivematica 1.8 and above, and has been developed and tested using Dataverse version 4.8.6.
Important
As of Archivematica 1.8 there are a number of workflow limitations that users may need to be aware of. Namely:
- Multiple authors are not captured in the Dataverse METS- only the first author listed is.
- It is not possible to delete packages after extraction using the Dataverse transfer type
There are a number of other enhancements/improvements to the workflow that could be supported in a future release along with these two issues. Please see issues filed in GitHub.
On this page
Selecting Datasets for preservation¶
When a Dataverse Transfer Source Location is selected in the Transfer tab of the Dashboard, users can browse a list of available datasets. Selecting a directory icon will expand the view to display the list of files included in the dataset. Individual files can’t be selected for Transfer.
When a Dataverse dataset is selected, the transfer type ‘Dataverse’ must also be selected.
Dataset contents¶
Dataverse provides a metadata file called dataset.json
that lists all of
the files included in the dataset as well as other descriptive metadata.
When a dataset includes tabular data files, Dataverse creates derivative formats and additional metadata files. See the Dataverse guide describing how a tabular data file bundle works.
Archivematica detects tabular data file bundles and retrieves all derivative files and metadata files.
Processing Dataverse datasets¶
Archivematica creates a Dataverse METS file to describe the contents and structure of the dataset as retrieved from Dataverse. Archivematica also creates an agents.json file, that includes details of the Dataverse instance configured in the Storage Service. This information is used to populate the Dataverse PREMIS agent details in the AIP METS.
Fixity checks are conducted using any checksums provided by Dataverse. Other microservices are carried out as normal (and as configured in the processing configuration). The final AIP will contain descriptive metadata provided by Dataverse, attributes to indicate any derivatives generated by Dataverse, and attributes to indicate the outcome of fixity checks conducted using checksums provided by Dataverse.
Important
When you are processing a Dataverse dataset that includes packaged material (i.e. .zip or .tar files), Archivematica can extract the contents of these files and run preservation microservices on the contents. This occurs during Microservice: Extract packages on the Transfer tab. However, due to a known bug, you must not delete the packages after they have been extracted.
Dataverse METS file¶
Archivematica generates a Dataverse METS file that describes the contents of the dataset as retrieved from Dataverse. The Dataverse METS includes:
- descriptive metadata about the dataset, mapped to the DDI standard
- a
<mets:fileSec>
section that lists all files provided, grouped by type (original, metadata or derivative) - a
<mets:structMap>
section that describes the structure of the files as provided by Dataverse. This is particularly helpful for understanding which files were provided in a tabular data file bundle.
The Dataverse METS is found in the final AIP in this location:
<AIP Name>/data/objects/metadata/transfers/<transfer name>/METS.xml
(This is also where you will find the dataset.json metadata file provided by
Dataverse, and the agents.json metadata file created by Archivematica).
AIP METS file¶
The Archival Information Package (AIP) METS file follows the basic structure
for a standard Archivematica AIP METS file. Derivatives generated by Dataverse
are indicated using the METS fileGrp attribute (where USE =“derivative”
).
The descriptive metadata (dmdSecs) in the Dataverse METS file are copied over to the AIP METS file.
In the PREMIS Object entity, relationships between original and derivative tabular format files from Dataverse are described using PREMIS semantic units. A PREMIS derivation event indicates the derivative file was generated from the original file, and a Dataverse Agent indicates the Event was carried out by Dataverse prior to ingest, rather than by Archivematica.
Fixity checks that use checksums provided by Dataverse are recorded as PREMIS events using the eventOutcomeDetailNote attribute to indicate the source of the checksum.
Configuration¶
Integration with a Dataverse repository is configured in the Storage Service. For detailed instructions, see the Administrators Manual.