### The OAIS Information Model Revisited — Part 1

#### by Alistair Miles

**Introduction & Motivation**

The Reference Model for an Open Archival Information System (OAIS) is an influential standard in the digital preservation domain. It contains an *information model*, which lays out some basic ideas about digital information, how it is encoded, interpreted and packaged. It also contains a *functional model*, which lays out the main functional components that should be present in a digital preservation system.

The CASPAR Project is currently designing and implementing software components for a distributed infrastructure to support digital preservation. The starting point for the design of these components is the OAIS reference model, and in particular, the OAIS information model.

This note captures some initial thoughts on the OAIS information model, working towards answers to the following questions:

- Does the OAIS Information Model make sense?
- Can it be used as the basis for designing software components, within a UML model-driven software engineering process?

This is work in progress, so don’t expect a faultless analysis!

**The OAIS Information Model**

The OAIS information model introduces the notion of *representation information*, and its key role in providing the additional information needed to yield usable *information* from *data*.

Section 2.2 of the OAIS reference model provides a preliminary definition of *information*, and of *information packages*. Section 4.2.1 then provides a more formal, “logical model” for archival information, using UML class diagrams.

Section 2.2 states that, “Data interpreted using its Representation Information yields Information,” which is illustrated informally in figure 2-2.

The distinction between *data* and *information* is made to emphasise the point that the preservation of *information content* is the primary goal of a preservation system. I.e. it is not good enough to simply preserve sequences of bits — it must be possible to derive at least some usable information from those bits at some point in the future. To be able to derive information from data, some *representation information* is required, which specifies how that data should be *interpreted*.

The relationships between data, information and representation information are then expressed in more detail in section 4.2.1, and in particular in the UML class diagram given in figure 4-10. An equivalent class diagram is shown below.

**Data, Information and Interpretation**

There is clearly a need to distinguish *representation information* as an essential component in the preservation of digitally encoded information, and this need is clearly expressed in OAIS sections 2.2 and 4.2.1.

However, there are a number of ambiguities and inconsistencies both within and between these two sections of the OAIS reference model, which could hinder the use of the OAIS information model as a robust starting point for a software design process.

**Interpretation as an Act**

Let us start again from the informal statement that, “Data interpreted using its Representation Information yields Information.”

This statement does **not** seem to suggest that, crudely stated, data + representation information = information. Rather, it suggests that, if you have some data and some representation information, then you will be *capable of interpreting* that data to yield some information. I.e. an *act of interpretation* is required in order to yield information from data using representation information.

To put this another way, it suggests that representation information is involved in some sort of *transformation* or *function* which maps data onto information.

We could illustrate this intuition using a UML activity diagram, for example:

**Representation Information as Information**

A question that naturally arises at this stage is, what is representation information? Is it information, or is it data? The name suggests that it is information — which could therefore have been obtained by the interpretation of some other data. In other words, OAIS section 2.2 suggests that there is no fundamental distinction between *information* and *representation information* — one person’s information could be another person’s representation information.

We could extend the UML activity diagram above to illustrate this notion of representation information as information, being itself derived from the interpretation of some other data:

This diagram could obviously be extended upwards or downwards in the same way forever — clearly illustrating the basic idea from the OAIS reference model that representation information itself depends on other representation information, leading to a *network of dependencies* between representation information.

**The Components of Information**

Now consider the statement made in section 4.2.1.1 that, “the Information Object is *composed of* a Data Object […] and the Representation Information that allows for the full interpretation of the data into meaningful information,” (my emphasis). This statement is reflected in figure 4-10 by the *aggregation associations* between the classes *Information Object*, *Data Object* and *Representation Information*.

From Enterprise Architect’s UML help:

An

aggregationrelationship is a type of association that shows that an element contains or is composed of other elements. Used in Class models to show how more complex elements (aggregates) are built from a collection of simpler elements (component parts; eg. a car from wheels, tires, motor and so on).

Thus, it seems that section 4.2.1 makes a fairly vague statement that information is *composed of* data and information, whereas section 2.2 makes the stronger statement that information can be *derived from* data using representation information (via an act of interpretation). While these two statements are not necessarily inconsistent, it is clear that the UML class diagram in figure 4-10 **does not** fully capture our intuitions about data and information from section 2.2.

The aggregation association is typically used to indicate a *part-whole relationship* between two classes. However, the idea that data and representation information are the *parts of* information seems quite different from the idea that information is obtained (derived) from the interpretation of data (by some act).

The idea that information *contains* data seems even less intuitive.

**Explicitly Modeling the Interpretation of Data**

There is another ambiguity and potential inconsistency between the informal definitions in section 2.2 and the more detailed information model in section 4.2.1.

Let us assume that representation information is information (and not data). According to figure 4-10, data is *interpreted using* representation information. However, representation information is also *interpreted using* representation information. There are clearly two completely different notions of interpretation here. In the first, data *can be* interpreted using representation information; in the second, representation information is itself *the result of* interpretation of some other data using some other representation information.

On a closely related point, the *cardinality* of representation information in the interpretation of data is also ambiguous. In figure 4-10, data is interpreted using exactly one item of representation information; however representation information is interpreted using any number of items of other representation information. Either there are two completely separate notions of interpretation here, or there is an inconsistency.

We can resolve these issues by replacing the UML class diagram of figure 4-10 with a different class model, in which the notion of *an interpretation* (of some data, using some representation information) is made explicit.

There are a number of alternatives here. One way of making the notion of an interpretation explicit is to take a purely static view, and define a class (*Interpretation*):

Note the following points about this class diagram:

- An instance of Interpretation involves exactly one instance of Data, one or more instances of Information playing the role of representation information, and yields exactly one instance of Information.
- An instance of Information is associated with exactly one instance of Interpretation (by which it was obtained).
- An instance of Data is associated with any number of Interpretations.
- An instance of Information can play the role of representation information in any number of Interpretations.

A second alternative is to take an entirely behavioural view of interpretation, and define an interface (*Interpreter*) which exposes an *interpret* operation:

Note that, informally speaking, the *interpret* operation takes as input an item of data and a number of items of information (as representation information), and returns a new item of information.

A third alternative is to take a hybrid view, and view an *Interpretation* as the result of an *interpret* operation:

Note in this model, the *interpret* operation returns an instance of *Interpretation*, which has an association back to the *Interpreter* which created it.

This third option is probably the most useful, from the point of view of defining software components that can e.g. automatically interpret some data according to machine-processable representation information (e.g. EAST specifications, DEDSL data dictionaries etc.) … however, I’ll try to explore that in more detail in further notes.