Notes on the Open Archival Information System (OAIS)

by wsampson

Back in 2002 Consultative Committee for Space Data Systems made a recommendation to the ISO for an Open Archival Information System. The recommendation has found broad acceptance and varying levels of compliance are usually elaborated upon in the digital repository software packages like DSpace. Since we want our archive to have a future as a federated or cooperating (OAIS terms) archive, and since the terminologies and concepts created in this document are widespread, I decided to take some notes on the recommendation as they relate to potential metadata elements we’ll employ.

The recommendation mostly concerns itself with the long term preservation of digital objects, although the framework incorporates metadata for physical objects as well. Broadly, OAIS defines an Information Object as a Data Object coupled with its Representation Information. The Representation Information allows a person to understand how the bits in the Data Object are to be interpreted. An example would be a TIFF file (Data Object) coupled with an ASCII document (Representation Information) detailing the headers, its compression method, etc., like here: TIFF description at Digital Preservation (The Library of Congress). Of course, one might also want Representation Information for the ASCII file, to explain how characters are interpreted in that format. OAIS terms this phenomenon recursive Representation Information and one might eventually accrue a Representation Network of such digital objects. One stops when the Knowledge Base of your Designated Community has the requisite knowledge to understand your top-most piece of Representation Information.

OAIS defines two types of Representation Information: Structure and Semantic. Structure Information describes the data format applied to the bit sequence to derive more meaningful values like characters, pixels, numbers, etc. Semantic Information describes the social meaning behind these higher values (for example that the text characters are English).

OAIS discourages using software that can access and use Data Objects as a replacement for comprehensive Representation Information. Although that would serve the end user well enough for a time, the software itself naturally poses its own obsolescence problem. Of course, the digital media we would like to preserve is mostly software itself. We may have datasets, images, scans, etc., but the majority of digital assets we hold are complete software packages. This includes operating systems, office suites, computer games, console games (on cartridges) and so on. Retrieving Representation Information for all these types of software will be a considerable and ongoing task, as most software will consist multiple file types.

On a side note, some research (or risk assessment) will need to be done as to what conditions will preservation copies and fair use copies. The Digital Millennium Copyright Act can be exceptionally restrictive with this. This issue falls into a broader topic concerning what control we have (and if it is sufficient control) for Content Information we receive for which we do not own the intellectual property rights. A Submission Agreement can iron out these details in the case of a donor, but materials coming from recycling is another issue to investigate.

Along with the Data Object and Representation Information, which together form the Content Information, OAIS specifies Preservation Description Information to ensure the comprehensibility of the former. OAIS lists four type of PDI, each one give us something to consider in regards to all our holding types (software, hardware and documents):

  • Provenance: custody information and processing history. Custody information is something we might want to really strive to extract from donors since it constitutes some of the history of personal and professional computing that we are trying to preserve. Other fields: revision history, license holder, registration, copyright.
  • Context: how the Content Information relates to information outside itself. This would be critical and quite formal if we were producing datasets, but as it stands this field seems open to a great deal of interpretation. Possible information could be use and purpose of the software, chip, computer, etc., conditions of use (where it was used, by who, etc.), and relations it may have with other materials we hold. Also manufacturing information, help files, user guides, the language of the package.
  • Reference: one or more unique identifiers. This should be contingent on our repository software and metadata set, and could include values like name, author/originator, version number or serial number.
  • Fixity: protection of the Content Information from undocumented alteration. Checksums, CRC, etc. It’s preferable to perform these operations for each file of a piece of software when we are not simply creating disk images of the media.

Wrapping around all this is the Packaging Information, which explains, identifies and relates the Content Information and PDI. I believe this is where we would store critical information on media-dependent attributes for our software such as tape block sizes, CD-ROM volume information, floppy block sizes and various filesystem information. This is vital information that needs to be preserved at every step, since it constitutes the history of information storage, the conditions of the initial write and so on. Along with a disk image there should be concrete information on these aspects of the original media. In addition if we have provided another compression/packaging layer on top of this, that information would be located here. For example if we had compressed our Content Information and PDI into a tarball, the Packaging Information would explain the .tar format.

The Content Information and PDI constitute the Archival Information Package (AIP), which should contain all needed information to allow Long Term Preservation. OAIS also specifies Archival Information Collections, which we may use in the case of donor collections.

OAIS also details four migration types we will want to be aware of:

  • Refreshment: simply copying a media instance from its original media to a new, identical piece of media. This would be when a floppy is getting old, looking old, etc., or when we are transferring data to a new hard drive (as routine backup or not). None of the infrastructure that points to AIPs needs to change, and all bits are preserved.
  • Replication: a transfer to a new media type where the Packaging Information, Content Information and PDI are exactly preserved. The infrastructure may need to be adjusted to point to a new location.
  • Repackaging: There are bits changed in the Packaging Information as a result of a transfer because new files and directories were created (though they may share the same structure and names of their originals).
  • Transformation: Bits are changed in the Content Information or PDI; the information content is preserved (supposedly).

Providing that we use disk images, and do not attempt to move a software’s individual files and directory structure to new media “by hand”, we should be able to stay in the safe realms of Refreshment and Replication.

The recommendation covers a great deal more in submission protocols, dissemination, management, policy and materials control, but I what’s covered here I feel is most relevant to keep in mind when designing the metadata schema for our holdings.