Archivematica Storage Service 0.22.0 is our latest release.

Administering the Storage Service

The Archivematica Storage Service allows the configuration of storage spaces associated with multiple Archivematica pipelines. It allows a storage administrator to configure what storage is available to each Archivematica installation, both locally and remote.

On this page

Storage Service glossary and organization

Pipelines

A pipeline refers to a single installation of an Archivematica dashboard. One Storage Service can be used to configure multiple Archivematica pipelines.

Spaces

A space models a specific storage device. That device might be a locally-accessible disk, a network share, or a remote system accessible via a protocol like FEDORA, SWIFT, DuraCloud, or LOCKSS. The space provides the Storage Service with configuration to read and/or write data stored within itself.

Packages are not stored directly inside a space; instead, packages are stored within locations, which are organized subdivisions of a space.

Locations

A location is a subdivision of a space. Each location is assigned a specific purpose, such as AIP storage, DIP storage, transfer source or transfer backlog, in order to provide an organized way to structure content within a space.

Packages

The Storage Service is oriented to storing packages. A “package” is a bundle of one or more files transferred from an external service; for example, a package may be an AIP, a backlogged transfer, or a DIP. Each package is stored in a location.

Archivematica configuration

When installing Archivematica, options to configure it with the Storage Service will be presented.

Configuring the Storage Service during Archivematica installation.

If you have installed the Storage Service at a different URL, you should change that here.

The Use default transfer source & AIP storage locations option will attempt to automatically configure default Locations for Archivematica, register a new Pipeline, and generate an error if the Storage Service is not available. Use this option if you want the Storage Service to automatically set up the configured default values.

The Register this pipeline & set up transfer source and AIP storage locations option will only attempt to register a new Pipeline with the Storage Service, and will not error if not Storage Service can be found. It will also open a link to the provided Storage Service URL, so that Locations can be configured manually. Use this option if the default values not desired, or the Storage Service is not running yet. Locations will have to be configured manually before any Transfers can be processed, or AIPs stored.

If the Storage Service is running, the URL to it should be provided and Archivematica will attempt to register the dashboard UUID as a new Pipeline. Otherwise, the dashboard UUID is displayed and a Pipeline for the Archivematica instance can be manually created and configured. The dashboard UUID is also available in Archivematica under Administration -> General.

Change the port in the web server configuration

The Storage Services uses nginx by default. To change the port, edit the file /etc/nginx/sites-enabled/storage and change the line that says listen 8000;, replacing 8000 with whatever port you use.

Keep in mind that in a default installation of Archivematica, the dashboard is running in Apache on port 80. It is not possible to make nginx run on port 80 on the same machine. If you install the Storage Service on its own server, you can set it to use port 80.

Make sure to adjust the dashboard UUID in the Archivematica dashboard under Administration -> General.

Pipelines

The pipeline in Archivematica is the Archivematica dashboard. All pipelines need to be registered with the Storage Service using the pipeline’s unique universal identifier (UUID). The UUID can be found in the Archivematica dashboard under Administration -> General Configuration. When you first install Archivematica, it will attempt to register the pipeline’s UUID automatically with the Storage Service with the description Archivematica on <hostname>.

Storage Service pipelines screen.

A single Storage Service can be connected to many pipelines. To connect a pipeline to an existing Storage Service, click on Create new pipeline on the Pipelines tab of the Storage Service.

Fields:

  • UUID: The unique identifier of the Archivematica pipeline.
  • Description: A description of the pipeline that will be displayed to the user. e.g. Development site.
  • Remote name: the base URL of the pipeline server. This is used for making API calls.
  • API username: The username to use when making API calls to the pipeline.
  • API key: The API key to use when making API calls to the pipeline.
  • Enabled: If checked, this pipeline can access locations associate with it. If unchecked, all locations associated with this pipeline will be disabled.
  • Default Location: If checked, the default locations that have been selected in Administration -> Configuration will be created for or associated with the new pipeline.

Spaces

Spaces contain all of the information needed to connect Archivematica to a storage location. The space is where protocol-specific information, like an NFS export path and hostname, or the username of a system accessible only via SSH, is stored.

Storage Service spaces screen.

Each space has specific configuration fields depending on the access protocol that is selected. See Access protocols below for more information.

Note

A space is usually the immediate parent of the Location folders. For example, if you want to define two transfer source locations located at /home/artefactual/archivematica-sampledata and /home/artefactual/maildir_transfers, the space’s path would be /home/artefactual/.

Access protocols

Archipelago

Configuring Archipelago as a storage service space: Archipelago is currently supported for AIP Storage locations. Archipelago is built on the content manaagement system Drupal, which accepts JSON API calls to create new entities. The storage service is able to send AIPs directly to Archipelago this way, creating new entities as an ‘AIP’ content type. Currently Archipelago integration is limited to AIP storage only. Retrieving, re-ingesting, deleting AIPs and integrating the API of the storage service with Archivematica is not yet implemented.

In Archipelago: There are plans by Archipelago’s developers to add AIP as a default object type for the platform. For now, we can create the AIP content type ourselves in a few steps.

  • In the Manage administrative menu, navigate to Structure > Content types (admin/structure/types). The Content types page appears showing all the available types of content.
  • Click Add content type. Name the content type AIP.
  • Add a single strawberry field only to the content type, and this will complete setup to allow transfers from Archivematica.
  • Ensure your user has write permissions via JSON API. To do this, on Archipelago under Administration > Configuration > JSON:API, tick the box “Accept all JSON:API create, read, update, and delete operations.” to allow JSON API writes to be made to archipelago. Specific permissions for the AIP content type can also be set under Administration > Struct > Content types > AIP > Manage permissions, where you can select if only authenticated users or anonymous users can create AIPs.

For more info on creating the content type in Archipelago, see drupal instructions here.

In Storage Service: AIPs can be sent to an Archipelago instance via JSON API calls. The Archipelago instance URL is required, as well as an Archipelago username and password with write permissions on the platform. * When creating the space, the staging path should be set to the default /var/archivematica/storage_service. * When creating the AIP storage location in the space, set the relative path to /.

During the transfer to Archipelago, the AIP file is uploaded first. If file uploads successfully, a new AIP type entity is created in Archipelago. If the AIP file doesn’t uploaded successfully, the new AIP entity is not created in Archipelago. The successful entity contains the uploaded file, and the corresponding Dublin Core metadata extracted from the mets.xml file associated with the AIP.

Fields:

  • Size: the maximum size allowed for this space. Set to 0 or leave blank for unlimited. This field is optional.
  • Path: the local path on the Storage Service machine to the CIFS share. This field is optional for Archipelago.
  • Staging Path: the absolute path to a staging area. When creating the space, the staging path should be set to the default /var/archivematica/storage_service.
  • Archipealgo URL: the hostname of the Archipelago web instance or IP address with port, e.g. archipelago.example.com:8443
  • Archipelago username: Archipelago username with write access required
  • Archipelago password: Archipelago password with write access required

Arkivum

Archivematica can use Arkivum’s A-Stor as an access protocol in version 0.7 and higher. A-Stor can expose a CIFS share to the Storage Service so that the Storage Service can copy files to an A-Stor datapool for AIP storage, for example.

Add an entry to /etc/fstab on the Storage Service, then mount the A-Stor CIFS share.

Example:

//ARK00092/astor /mnt/astor cifs
defaults,guest,file_mode=0666,dir_mode=0777,uid=archivematica,gid
=archivematica,forcegid,forceuid,rw 0 1

In this example, ARK00092 is the name of the appliance and should be resolvable through DNS or be set as an entry in /etc/hosts.

Then, choosing Arkivum as the access protocol, create a new space in the Storage Service.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the local path on the Storage Service machine to the CIFS share, e.g. /mnt/astor
  • Staging Path: the absolute path to a staging area. Must be UNIX filesystem compatible, preferably on the same filesystem as the path, e.g. /mnt/astor/archivematica1/tmp
  • Host: the hostname of the Arkivum web instance or IP address with port, e.g. arkivum.example.com:8443
  • Remote user: the username on the remote machine accessible via passwordless ssh. This field is optional.
  • Remote name: the name or IP of the remote machine. This field is optional.

Dataverse

Dataverse Integration is supported with Archivematica v1.8 (and higher) and Storage Service v0.13 (and higher).

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Host: Hostname or IP address of the Dataverse instance, e.g. test.dataverse.org
  • API key: the key generated by Dataverse for a specific user account
  • Agent name: a string that will be used in Archivematica METS file to identify this Dataverse instance as a PREMIS agent, e.g. My Institution's Dataverse Instance.
  • Agent type: a string that will be used in Archivematica METS file to identify the type of PREMIS agent named above, e.g. organization.
  • Agent identifier: a string that will be used as an identifier in the Archivematica METS file to uniquely identify the PREMIS agent named above.

Dataverse spaces support Transfer Source Locations (Locations for other purposes are not currently supported). At least one location should be created as a Transfer Source.

Within this location, the relative path can be used to set two of the parameters available in the Dataverse Search API. The q (or “Query”) parameter is a general search parameter. The ’subtree’ parameter can be used to indicate a sub-dataverse. For example, the following entry in ‘Relative Path’:

Query:*
Subtree:Archivematica

will return all datasets within the Archivematica dataverse. The other API parameters are set using the fields described above for the space, or are set with fixed values. The parameters used are:

URL: https://<Host field set in Space configuration>/api/search/

{
  'q': '<Relative Path field set in Location configuration>',
  'sort': 'name',
  'key': u '<API Key field set in Space configuration>',
  'start': 50,
  'per_page': 50,
  'show_entity_ids': True,
  'type': 'dataset',
  'subtree': '<Relative Path field set in Location configuration>',
  'order': 'asc'
}

Search results are currently limited to 50 datasets. For repositories with more than 50 datasets we recommend creating multiple Locations with more specific search criteria. For further details of the API parameters, see the Dataverse api guide.

DuraCloud

Archivematica can use DuraCloud as an access protocol for the Storage Service in version 0.5 and higher. A Storage Service space has a one-to-one relationship to a space within DuraCloud.

To set up your Archivematica instance with DuraCloud, please see Using DuraCloud with Archivematica.

Fields:

  • Access protocol: DuraCloud
  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: Leave this field blank.
  • Staging path: A location on your local disk where Archivematica can place files for staging purposes, for example var/archivematica/storage_service/duracloud_staging
  • Host: Hostname of the DuraCloud instance, e.g. site.duracloud.org
  • Username: The username for the Archivematica user that you created in DuraCloud.
  • Password: The password for the Archivematica user that you created in DuraCloud.
  • Duraspace: The name of the space in DuraCloud you are creating this Storage Service space for (e.g. transfer-source, aip-store, etc).

DSpace via REST API

DSpace via REST API locations are supported for both AIP and DIP Storage locations. Because DSpace is typically used as a public-facing system, the behaviour is different than when using other access protocols for AIP Storage. Upon deposit in DSpace, the AIP will be deposited as a single compressed objects file (bitstream), which contains all original and normalized objects as well as the metadata pertaining to them. The DIP will be deposited as uncompressed files with the folder structure flattened.

Presently, the Storage Service and Dashboard are not capable of downloading/reconstituting the AIP - this must be done manually from the DSpace interface.

The DSpace via REST API space offers a novel way to facilitate increased automation. It is possible to supply metadata with the transfer as either a CSV or a JSON file. In the entry for the root folder objects it is possible to define any number of Dublin Core Metadata Elements Set properties which will map on to the corresponding DSpace record created for the AIP/DIP. In addition there are three optional custom properties which can be defined:

  • dspace_dip_collection: the UUID of the DSpace collection into which the DIP will be deposited. E.g. a12d749c-7727-4121-b6be-478cacde658f
  • dspace_aip_collection: the UUID of the DSpace collection into which the AIP will be deposited. E.g. 80c3519d-7a07-4830-beae-a868c149ecbe
  • archivesspace_dip_collection: the identifier of the archival object for which a child digital object will be created which will link to the DSpace DIP record. E.g. 135569

Note

Note that the DSpace via REST API space only supports DSpace 6.x and not other versions of DSpace.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • REST URL: URL to the “REST” webapp. E.g. http://localhost:8080/rest/; for production systems, this address will be slightly different, such as: https://demo.dspace.org/rest/.
  • User: a username for the DSpace instance with sufficient permissions to permit authentication.
  • Password: the password for the username above.
  • Default DSpace DIP collection id: the UUID of the collection into which the DIP will be deposited barring it being designated in a transfer metadata file.
  • Default DSpace AIP collection id: the UUID of the collection into which the AIP will be deposited barring it being designated in a transfer metadata file.

Note

The following seven fields are optional and can safely be ignored should you only require a connection to DSpace.

Unlike the DSpace via SWORD2 space, which uses the Dashboard administration tab to configure a single ArchivesSpace, the DSpace via REST space gives you the option to configure different ArchiveSpaces instances per space.

  • ArchivesSpace URL: URL to the ArchiveSpace server. E.g.: http://sandbox.archivesspace.org/.
  • ArchivesSpace user: ArchivesSpace username to authenticate as
  • ArchivesSpace password: ArchivesSpace password to authenticate with
  • Default ArchivesSpace repository: Identifier of the default ArchivesSpace repository
  • Default ArchivesSpace archival object: Identifier of the default ArchivesSpace archival object barring it being designated in a transfer metadata file
  • Send AIP to Tivoli Storage Manager: this is a feature specific to the requirements of Edinburgh University which sponsored the development of this space. Essentially it executes a bash command using a binary called dsmc.
  • Verify SSL certificates: Requests verifies SSL certificates for HTTPS requests, just like a web browser. By default, SSL verification is enabled, and Requests will throw a SSLError if it’s unable to verify the certificate:

DSpace via SWORD2 API

DSpace via SWORD2 locations are currently supported only for AIP Storage locations. Because DSpace is typically used as a public-facing system, the behaviour is different than when using other access protocols for AIP Storage: upon deposit in DSpace, the AIP will be split into two parts:

  • a compressed objects file (bitstream), which contains all original and normalized objects
  • as well as a metadata file (bitstream), which contains all of the bag artifacts, metadata and logs

The metadata bitstream can optionally be restricted; see below. Presently, the Storage Service and Dashboard are not capable of downloading/reconstituting the AIP - this must be done manually from the DSpace interface.

If using DSpace as the AIP location in conjunction with the ArchivesSpace workflow in the Appraisal tab, a post Store AIP hook will send the DSpace handle to the ArchivesSpace digital object record upon AIP storage. The ArchivesSpace configuration is set up in the Dashboard administration tab.

Note

Note that the DSpace via SWORD2 API space makes use of the DSpace REST API to change the permissions of the metadata file. This means that you need to make sure that the REST API is configured, see the DSpace 5 REST API documentation.

The DSpace via SWORD2 API space functionality only supports DSpace 5 and not other versions. See changes in authentication in the DSpace REST API between versions 5 and 6 in the DSpace 6 REST API documentation.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Service Document IRI: URL of the service document. E.g. http://demo.dspace.org/swordv2/servicedocument, where servicedocument is the handle to the community or collection being used for deposit.
  • User: a username for the DSpace instance with sufficient permissions to permit authentication.
  • REST URL: URL to the “REST” webapp. E.g. http://localhost:8080/rest/; for production systems, this address will be slightly different, such as: https://demo.dspace.org/rest/.
  • Password: the password for the username above.
  • Restricted metadata policy: Use to restrict access to the metadata bitstream. Must be specified as a list of objects in JSON, e.g. [{"action":"READ","groupId":"5","rpType":"TYPE_CUSTOM"}]. This will override existing policies.

Fedora via SWORD2

Fedora via SWORD2 is currently supported in the Storage Service as an Access Protocol to facilitate use of the Archidora plugin, which allows ingest of material from Islandora to Archivematica. This workflow is in beta testing as of Storage Service 0.9/Archivematica 1.5/Islandora 7.x-1.6.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Fedora user: Fedora user name (for SWORD functionality).
  • Fedora password: Fedora password (for SWORD functionality).
  • Fedora name: Name or IP of the remote Fedora machine.

Note

  • A Location (see below) must also be created, with the purpose of FEDORA Deposits.
  • On the Archivematica dashboard, the IP of the Storage Service needs to be added to the IP whitelist for the REST API, so that transfers will be approved automatically.
  • A post-store callback can be configured, to enable Islandora to list objects that can be deleted once they have been stored by Archivematica. See the Administration section.

GPG encryption on local file system

Creating a GPG encryption space will allow users to create encrypted AIPs and transfers. Only AIP storage, Transfer backlog and Replicator locations can be created in a GPG encryption space.

Encrypted AIPs and transfers can be downloaded unencrypted via the Storage Service and Archivematica dashboard.

Before creating a GPG encryption space ensure that you have created or imported a GPG key on the Administration page.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Key: the encryption key to be used for the space.

Important

It is possible to encrypt uncompressed AIPs, which will be stored as tar files.

Uncompressed AIPs do not have pointer files, so if the key for the space is changed and the original key is deleted/unknown, Archivematica will have no record of the key for decryption.

Local Filesystem

Local Filesystem spaces handle storage that is available locally on the machine running the Storage Service. Typically this is the hard drive, SSD or raid array attached to the machine, but it could also encompass remote storage that has already been mounted. For remote storage that has been locally mounted, we recommend using a more specific Space if one is available.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.

LOCKSS

Archivematica can store AIPs in a LOCKSS network via LOCKSS-O-Matic, which uses SWORD to communicate between the Storage Service and a Private LOCKSS Network (PLN).

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Service document IRI: the URL of the LOCKSS-o-matic service document IRI, e.g. http://lockssomatic.example.org/api/sword/2.0/sd-iri.
  • Content Provider ID: the On-Behalf-Of value when communicating with LOCKSS-o-matic.
  • Externally available domain: the base URL for this server that LOCKSS will be able to access. Generally this is the URL for the home page of the Storage Service.
  • Keep local copy?: check this box if you wish to store a local copy of the AIPs even after they are stored in LOCKSS.

Note

When creating a Location for a LOCKSS space (see below), the Purpose of the Location must be AIP Storage.

NFS

NFS spaces are for NFS exports mounted on the Storage Service server and the Archivematica pipeline.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to where the space is mounted on the filesystem local to the Storage Service.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Remote name: the hostname or IP address of the remote computer exporting the NFS mount.
  • Remote path: the export path on the NFS server
  • Version: the version of the filesystem, e.g. nfs or nfs4,as would be passed to the mount command.
  • Manually mounted: This is a placeholder for a feature that is not yet available.

Pipeline Local Filesystem

Pipeline Local Filesystems refer to the storage that is local to the Archivematica pipeline, but remote to the Storage Service. For this Space to work properly, passwordless SSH must be set up between the Storage Service host and the Archivematica host.

For example, the Storage Service is hosted on storage_service_host and Archivematica is running on archivematica1. The transfer sources for Archivematica are stored locally on archivematica1, but the Storage Service needs access to them. The Space for that transfer source would be a Pipeline Local Filesystem.

Note

Passwordless SSH must be set up between the Storage Service host and the computer Archivematica is running on.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to where the space is mounted on the filesystem local to the Storage Service.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Remote user: the username on the remote host.
  • Remote name: the hostname or IP address of the computer running Archivematica. This should be SSH accessible from the Storage Service computer.
  • Assume remote host serving files with rsync daemon: if checked, the Storage Service will use rsync daemon-style commands instead of the default rsync with remote shell.
  • Rsync password: the password for the rsync daemon

RClone

rclone is a command-line program to manage files in cloud storage, and is available as an access protocol in Storage Service 0.20 and higher.

The RClone space allows for use of over 40 cloud providers with Archivematica as Transfer Source, AIP Store, DIP Store, and Replicator locations. Configuration of details such as access keys can be done with a configuration file or via environment variables (the recommended method). See the rclone documentation on configuration via environment variables.

Fields:

  • Access protocol: RClone
  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: Leave this field blank.
  • Staging path: A location on your local disk where Archivematica can place files for staging purposes, for example var/archivematica/storage_service/rclone_staging.
  • Remote name: Remote name for the rclone configuration to use with this Space. Must match value in environment variables, case-insensitive.
  • Container/Bucket name: Container or bucket name to use in configured remote (optional, depending on service being used via rclone).

Swift

OpenStack’s Swift is available as an access protocol in Storage Service 0.7 and higher.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.
  • Auth URL: the URL to authenticate against.
  • Auth version: the OpenStack authentication version.
  • Username: the Swift username that will be used for authentication.
  • Password: the password for the above username.
  • Container: the name of the Swift container. To list available containers in your Swift installation, run swift list from the command line.
  • Tenant: the tenant/account name, required when connecting to an auth 2.0 system.
  • Region: the region in Swift. This field is optional.

Note

Swift cannot be used for transfers containing an object over 5GB (for uncompressed packages) or that are 5GB in total (for compressed packages). This applies to the Transfer Source, Backlog, AIP and DIP Locations.

Any object/package over 5GB must be segmented by the application interacting with Swift, a function which is not currently available for the Swift space. See the Swift documentation for large objects.

S3 (Amazon)

Amazon S3 is available as an access protocol as of Storage Service version 0.12. Locations within S3 can be used as AIP Storage, DIP Storage, Replicator and Transfer Source locations.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: Not required: see below.
  • Staging path: the absolute path to a staging area on the same server as the Storage Service.
  • S3 Endpoint URL: the URL of the S3 endpoint, e.g. https://s3.amazonaws.com.
  • Access Key ID to authenticate: the public key generated by S3. Not required: see below
  • Secret Access Key to authenticate with: the secret key generated by S3. Not required: see below
  • Region: the region that the S3 instance uses, e.g. us-east-2.
  • S3 Bucket: the name of the designated S3 Bucket. This field is optional.

Note

Not all fields are required when configuring S3 storage. They can however still be used.

When a Path is configured for S3 as well as a Relative Path for the Location (see locations) the Storage Service will attempt to create an S3 bucket to appear as follows to the user (each newline indicates a nested directory):

In this example:

  • d6d618dd-2b7f-4177-8b59-10e242066cb7 is the UUID of the S3 storage space that has been configured.
  • cd20f886-0c40-4202-af80-399b6ca9f1f1 is the UUID of the AIP that we want to store.
  • storage-space-path is configured for the Storage Space Path (no leading slash).
  • storage-location-relative-path is configured as Location (Relative-path) (no leading-slash).

The values for the storage space Path. And Location Relative Path will appear as configured:

s3
└── d6d618dd-2b7f-4177-8b59-10e242066cb7
    └── storage-space-path
        └── storage-location-relative-path
            └── cd20
                └── f886
                    │
                    └── ...additional pair-tree folders...
                        │
                        └── f1f1
                            └── cd20f886-0c40-4202-af80-399b6ca9f1f1.7z

When Access Key ID and Secret Access Key are configured, the Storage Service will attempt to authenticate with those details.

When these values are not configured, the Storage Service will attempt to use authentication details provided by the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables.

When the bucket name is configured, that bucket will be used. Otherwise, the ID of the space will be used instead. In either case the bucket will be automatically created if necessary, and if the AWS user has permissions to do so.

More specifically, the user associated with the access key must be granted permission to the following actions on the S3 service to interact with the bucket:

List permissions

s3:ListBucket

Grants permission to list some or all of the objects in an Amazon S3 bucket (up to 1000).

Missing this permission does not prevent packages from being stored, but raises this exception when the Storage Service tries to browse the space or its locations:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjects operation: Access Denied
ERROR     2024-03-04 12:56:10  django.request:log:log_response:241:  Internal Server Error: /api/v2/location/34664de6-025f-40a2-87f5-8720ce51169d/browse

Read permissions

s3:GetObject

Grants permission to retrieve objects from Amazon S3.

Missing this permission does not prevent packages from being stored, but raises exceptions like these when the Storage Service tries to retrieve the package or extract a file in it:

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
ERROR     2024-03-04 12:59:36  django.request:log:log_response:241:  Internal Server Error: /api/v2/file/bf96a036-2631-4fb6-bcdb-781d7690163e/download/

botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
ERROR     2024-03-04 12:59:19  django.request:log:log_response:241:  Internal Server Error: /api/v2/file/bf96a036-2631-4fb6-bcdb-781d7690163e/extract_file/

s3:GetBucketLocation

Grants permission to return the Region that an Amazon S3 bucket resides in.

Missing this permission raises this exception when the Storage Service tries to store the package:

locations.models.StorageException: An error occurred (AccessDenied) when calling the GetBucketLocation operation: Access Denied
ERROR     2024-03-04 13:03:04  django.request:log:log_response:241:  Internal Server Error: /api/v2/file/

Write permissions

s3:PutObject

Grants permission to add an object to a bucket.

Missing this permission raises this exception when the Storage Service tries to store the package:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the PutObject operation: Access Denied
ERROR     2024-03-04 13:06:26  django.request:log:log_response:241:  Internal Server Error: /api/v2/file/

s3:DeleteObject

Grants permission to remove the null version of an object and insert a delete marker, which becomes the current version of the object.

Missing this permission raises this exception when the Storage Service tries to delete a package:

ERROR     2024-03-05 10:20:41  django.request:log:log_response:241:  Internal Server Error: /packages/package_delete_request/
Traceback (most recent call last):
...
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the DeleteObject operation: Access Denied

s3:CreateBucket

Grants permission to create a new bucket.

This permission is only necessary if you want to give the S3 space the ability to create its bucket when it doesn’t exist yet.

Missing this permission or setting up the incorrect bucket raises this exception when the Storage Service tries to store the package:

botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Access Denied
ERROR     2024-03-05 08:11:34  django.request:log:log_response:241:  Internal Server Error: /api/v2/file/

In Amazon Web Services these permissions can be granted through IAM policies attached to the user or its groups. For example, a policy that grants these permissions on a bucket called mybucket might look like this:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket",
        "s3:DeleteObject",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::mybucket/*",
        "arn:aws:s3:::mybucket"
      ]
    }
  ]
}

Warning

For the Path configured in the Storage Space and Relative Path configured in the Location, they are best configured without a leading-slash (/) as this may impact the ability to delete stored packages on the S3 service. This is an impact of the third-party library used to make S3 API calls.

If users have stored packages on paths with leading slashes and packages are not being removed from S3, consider modifying the configuration to remove the slash. Do so with caution and be prepared to put the leading-slash back if there is any impact on storing AIPs as a result.

Debugging S3

There are times when the S3 storage adapter may need to be debugged. For example, if a transfer isn’t able to complete because it cannot reach your S3 implementation.

The S3 adapter written for the Storage Service relies heavily on the Boto3 S3 SDK (Software Development Kit). The library is hosted on GitHub and the GitHub issues for Boto3 are a good place to start when trying to understand potential upstream issues.

In the Storage Service, debug logging can be increased for the Boto3 adapter. There is more information about overriding the Storage Service defaults in the installation README.md.

In the logging configuration, administrators should be able to find entries for the two primary Boto3 components:

"boto3": {"level": "INFO"},
"botocore": {"level": "INFO"}

Changing the log level for these entries from INFO to DEBUG will output the entire wire-trace between the Storage Service and your S3 implementation through the lens of the Boto3 SDK. The standard Boto3 logging will provide high-level information, and botocore will be much more detailed.

From the documentation, the Boto3 developers are careful to note as follows:

Warning

Be aware that when logging anything from ‘botocore’ the full wire trace will appear in your logs. If your payloads contain sensitive data this should not be used in production.

Note

When updating the debug configuration, if the Boto3 entries are not present then they can be added manually. If adding these manually then keep in mind that the change must not compromise the integrity of the JSON.

Write-Only Replica Staging on local filesystem

Write-Only Replica Staging spaces allow users to stage AIP replicas on a local filesystem for delivery to offline storage systems such as tape robots. Only Replicator locations can be created in a Write-Only Replica Staging space.

Fields:

  • Size: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Path: the absolute path to the Space on the local filesystem.
  • Staging path: the absolute path to a staging area. Must be UNIX filesystem compatible and preferably will be located on the same filesystem as the path.

Important

Write-Only Replica Staging is a write-once space. Replicas stored in a Replicator location in a Write-Only Replica Staging space cannot downloaded, deleted, or fixity checked after storage.

Important

Replicas written within a Write-Only Replica Staging space are stored directly within the Replicator location’s path rather than within the typical UUID quad directories. This enables efficient retrieval of packages stored in the space by offline storage systems.

Locations

Locations are contained within spaces and have a defined purpose in the Archivematica system. Each location is associated with at least one pipeline. A pipeline can have multiple instances of any location, and a location can be associated with any number of pipelines, with the exception of Backlog and Currently Processing locations, for which there must be exactly one per pipeline.

Storage Service locations screen.

A location can have one of nine purposes: AIP Recovery, AIP Storage, Currently Processing, DIP Storage, FEDORA Deposit, Storage Service Internal Processing, Transfer Backlog, Transfer Source, or Replicator.

Fields:

  • Purpose: the function that this location will fulfill, e.g. AIP storage. See location purposes for more information.
  • Pipelines: the Archivematica instance(s) that will be able to use this location.
  • Relative Path: the path to this location, relative to the space that contains it.
  • Description: a description of the location to be displayed to the user.
  • Quota: the maximum size allowed for this space. Leave blank for unlimited. This field is optional.
  • Enabled: if checked, this location will be accessible to pipelines associated with it. If unchecked, it will not be available to any pipeline.
  • Set as global default location for its purpose: if checked, this location will be the default location for its purpose unless the user specifically tells Archivematica otherwise during processing.

Note

When setting up a DSpace via SWORD2 location the relative path needs to be the URL of the destination collection for the transfers. E.g.: https://demo.dspace.org/10673/60/.

Location purposes

AIP Recovery

AIP Recovery locations are where the AIP recovery feature looks for an AIP to recover. No more than one AIP recovery location should be associated with a given pipeline. The default value is /var/archivematica/storage_service/recover in a Local Filesystem. This is only required if AIP recovery is used.

AIP Storage

AIP Storage locations are where the completed AIPs are placed for long-term storage. For pipelines using the default locations, the default path is `/var/archivematica/sharedDirectory/www/AIPsStore in a Local Filesystem. An AIP storage location is required to store and retrieve AIPs.

Currently Processing

Archivematica uses the Currently Processing location associated with that pipeline to store materials during active processing. Exactly one currently processing location should be associated with a given pipeline. For pipelines using the default locations, the default path is /var/archivematica/sharedDirectory in a Local Filesystem. A currently processing location is required for Archivematica to run.

DIP Storage

The DIP Storage location is used for storing DIPs until such a time that they can be uploaded to an access system. For pipelines using the default locations, the default path is /var/archivematica/sharedDirectory/www/DIPsStore in a Local Filesystem. A DIP storage location is required to store and retrieve DIPs, but it is not required to upload DIPs to access systems.

FEDORA Deposit

A FEDORA Deposit location is used with the Archidora plugin to ingest material from Islandora. This is only available to the FEDORA Space, and is only required for that space.

Storage Service Internal Processing

There should only be one Storage Service Internal Processing location for a Storage Service installation. For pipelines using the default locations, the default path is /var/archivematica/storage_service in a Local Filesystem. This is required for the Storage Service to run, and must be locally available to the Storage Service. It should not be associated with any pipelines.

Transfer Backlog

Transfer Backlog locations store transfers until the user continues processing them. For pipelines using the default locations, the default path is /var/archivematica/sharedDirectory/www/AIPsStore/transferBacklog in a Local Filesystem. This is required to store and retrieve transfers in backlog.

Transfer Source

A list of Transfer source locations are displayed in the transfer source dropdown on the Archivematica pipeline’s Transfer tab. Any folder in a transfer source can be selected to become a transfer. The default value is /home in a local filesystem.

Replicator

Replicator locations can be configured to replicate the AIPs in one or more AIP storage locations. Replicators can be configured for each of the following protocols supported by the Storage Service:

  • Duracloud.
  • Local filesystem, including GPG encrypted and write-only replica staging.
  • S3.

Replicators are associated with an AIP storage location via the storage location’s configuration page. As such Replicators must be configured before they can be selected for use alongside an AIP store. There is additional information about this under configure location. If you would like your replicated AIPs to be encrypted, create the location in an encrypted space.

Note

If you want the same space to have multiple purposes, multiple locations with different purposes can be created.

Outcomes of a Replicator

One AIP to many Replicators:

An AIP will be replicated n- times where there are n- configured Replicators for an AIP storage location.

On reingest, replicas are replaced and given new identifiers:

During reingest, previous replicas will be deleted and marked as DELETED within the Storage Service Packages tab. A new replica will be created and marked as UPLOADED. The new replica will be assigned a new UUID. Pointer files will be updated with accurate references between the source and the replica and vice versa.

Reingest can be used to update or delete replicas:

If a Replicator is disabled, reingest can be used to selectively remove replicas from that replica location. If a new Replicator is enabled, reingest can be used to replicate an existing AIP in that replica location.

Creating and updating replicas without reingest:

Replicas can also be created and updated in bulk without the need for reingest using the create AIP replicas management command.

Deletion of the source AIP will delete the replicas:

When the source AIP is deleted via Archivematica or the Storage Service, the replicas will also be deleted. Conversely, when a replica is deleted, only the replica within the given location is affected.

Limitations of a Replicator

Tight-coupling with AIP storage in a Pipeline:

Replication is coupled with AIP storage. As such, if there is a failure when storing a replica, even if the source AIP has succeeded, an Archivematica pipeline will still fail during processing. This affects indexing of the AIP in the Archival Storage tab within Archivematica. It also leaves the replica packages in staging in the Storage Service. The Storage Service user-interface will clearly mark the replica packages’ status as STAGING.

New UUIDs during reingest:

Since the AIP will be given a new UUID during reingest, external services tracking the packages in the Storage Service, especially replicas, will need to synchronize their own record accordingly.

How to configure a location

To create and configure a new Location:

  1. In the Storage Service, navigate to the Spaces tab.

  2. Under the space that you want to add the location to, click on Create Location here.

  3. Choose a purpose (e.g. AIP Storage) and pipeline, and enter a “Relative Path” (e.g. var/mylocation) and human-readable description. The Relative Path is relative to the Path defined in the Space you are adding the Location to. For example, for the default Space, the Path is / so your Location path would be relative to that (in the example here, the complete path would end up being /var/mylocation).

    Note

    If the path you are defining in your Location doesn’t exist, you must create it manually and make sure it is writable by the Archivematica user.

  4. For an AIP storage location, choose Replicator location(s) if desired. To select more than one, hold down ctrl- while clicking the UUID of the replica.

    Note

    One or more Replicator locations must already have been configured to be selected. If a Replicator location is configured afterwards then it can be added via this page.

    Replica AIPs will be created only after a location has been associated with a Replicator, but can also be created retroactively via the create AIP replicas management command.

  5. Save the Location settings.

  6. The new Location will now be available as an option under the appropriate options in the Dashboard, for example as a Transfer location (which must be enabled under the Dashboard “Administration” tab) or as a destination for AIP storage.

Packages

Storage Service packages screen.

A package is a file that Archivematica has stored in the Storage Service, commonly an Archival Information Package (AIP). Dissemination Information Packages (DIPs) which have been stored and transfers which have been sent to backlog will also be listed in the Packages tab.

AIPs cannot be created or deleted through the Storage Service interface, though a deletion request can be submitted through Archivematica that must be approved or rejected by the Storage Service administrator. To learn more about deleting an AIP, see Deleting an AIP.

Stored DIPs can be deleted through the Storage Service interface by choosing the “Delete” option in the “Actions” column.

Deletion requests for transfers are automatically generated when all of the objects from the transfer have successfully been stored in AIPs.

For more information about Fixity Status, see Fixity.

Administration

The Administration section manages the users and settings for the Storage Service.

Storage Service Administration screen.

Configuration

The configuration page allows you to control the behaviour of the Storage Service.

Pipelines are disabled upon creation? allows you to decide whether a newly created Archivematica pipeline can access the Locations that are assigned to it. By disabling newly created pipelines, you can provide some security against unwanted perusal of the files in assigned locations, or use by unauthorized Archivematica instances. This can also be configured individually when creating a pipeline manually through the Storage Service website.

Object counting in spaces is disabled? allows you to disable automatic object counting for spaces that display count information to users, like the space containing the transfer source location. Having object counting enabled can cause delays and timeouts in the dashboard.

A list of directory names with object counts beside them - for example, ``archivematica-sampledata (3810 objects)``.

The transfer browser with object counting enabled.

A list of directory names with no object counts beside them - for example, ``archivematica-sampledata``.

The transfer browser with object counting disabled.

Recovery request

These fields allow you to set up notifications for events related to the AIP deletion workflow - for example, deletion approvals or rejections.

Fields:

  • Recovery request: URL to notify: the server that should receive the JSON-encoded event message from the Storage Service.
  • Recovery request notification: Username (optional): A username for basic access authentication.
  • Recovery request notification: Password (optional) A password for basic access authentication.

Default locations

The default location settings allow you to define default locations for any new pipeline that is registered with the Storage Service. You can define default locations for the following:

  • Transfer source
  • AIP storage
  • DIP storage
  • Transfer backlog
  • AIP recovery

Multiple transfer source or AIP storage locations can be configured by holding down Ctrl when selecting them.

A Currently Processing location is also created for every new pipeline, since one is required.

Users

The Users section allows you to manage users for the Storage Service. Only registered users can log into the Storage Service.

There are four user roles: administrator, manager, reviewer, and reader.

Readers can see all information pipelines, spaces, locations, and packages tab, but cannot edit, manage deletion requests, or take any other action in those tabs. In the administration tab, readers can edit their own user profiles. Readers can also change the language of the Storage Service interface for themselves.

Reviewers have all the permissions listed above, and can additionally review, approve, and reject package recovery and deletion requests.

Managers have all the permissions listed above, and can additionally create, edit, disable, and delete pipelines, spaces, and locations. Managers can also configure the Storage Service, create callbacks, and create and delete keys.

Administrators have all the permissions listed above, and can additionally create and manage other Storage Service users. Administrators will also receive email notifications when special events occur.

Version

The version page will display the current version and specific git commit of your installation of the Storage Service.

Service callbacks

Callbacks allow Archivematica Storage Service to make REST calls after performing certain types of actions, so that external services are notified when internal actions have taken place. You can create callbacks to alert external services when an AIP, DIP, or AIC has been stored.

To create a new callback, click on Create new callback. This will bring you to a form where you can enter the callback information.

Fields:

  • Event: Type of event when this callback should be executed (i.e. post-store AIP, post-store DIP)
  • URI: URL to contact upon callback execution.
  • Method: HTTP request method to use in connecting to the URL (i.e. GET, POST)
  • Headers (key/value): the header(s) for the request.
  • Body: Body content for each request. Set the ‘Content-type’ header accordingly.
  • Expected status: Expected HTTP response from the server, used to validate the callback response.
  • Enabled: check the box to enable the callback.

Event types available include:

  • Post-store AIP (source files): Occurs after an AIP has been stored and causes the execution of a request for each source file of the AIP.
  • Post-store AIP, Post-store AIC and Post-store DIP: Occurs after an AIP, AIC or DIP has been stored and causes the execution of a single request for the package.

You can use the following placeholders in the URI and Body fields:

  • Post-store AIP (source files): <source_id> will be replaced by the source file UUID.
  • Post-store AIP, Post-store AIC and Post-store DIP: <package_uuid> will be replaced by the AIP, AIC or DIP UUID. <package_name> will be replaced by the AIP, AIC or DIP name, with the trailing UUID removed.

Note

For AIPs created directly from a transfer, the value that replaces <package_name> will be equal to the name of the transfer after successful completion of the “Change transfer filenames” Microservice.

External applications can integrate with Archivematica via the post_store API endpoint or stored endpoints. See the API documentation for more information on using these endpoints.

A callback can be configured for the SCOPE integration. See the SCOPE documentation for how to set up this callback.

Encryption keys

GPG encryption keys can be created or imported to be used in spaces to store encrypted AIPs, transfers or replicated AIPs/transfers. Keys can either be created by the Storage Service or imported.

To create a new key:

  1. Click on Create New Key
  2. Enter the name and email address you want associated with the key.

To import a key:

  1. Click on Import Existing Key
  2. Paste in your key in ASCII-armored format.

Language

Configure language settings for the Storage Service in this area of the Administration tab. Strings are available for translation on the localization platform (Transifex).

Back to the top

Archivematica Storage Service 0.22.0

License

Archivematica documentation by Artefactual Systems Inc. is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Creative Commons License