The OAIS Information Model Revisited — Part 2.

by Alistair Miles

Previously, in The O.A.I.S. Information Model Revisited – Part 1, I explored the abstract notions of data, information, representation information and interpretation.

I found that the O.A.I.S. notion of interpretation makes most sense when viewed as an act or operation, taking data and representation information as input, yielding new information as output.

In this note, I’d like to explore these ideas further, and see how they related to some real world examples of digital preservation.

Recipes” for Interpreting Archived Data

In particular, I’m interested in the “recipes” that tell you how to convert a sequence of bits into something more useful.

This is a fundamental requirement for any preservation archive – when retrieving an archived information item, you need the bits that encode that information, but you also need to know how to turn those bits into something else, something you can use.

The O.A.I.S. Information Model acknowledges this, by highlighting the need for representation information, but does it go far enough? Does the model really help us to understand the problems of reconstructing a useful artefact from an archived sequence of bits?

Automation and Virtualisation

Finally, it should be possible, at least to a limited extent, to automate the reconstruction of useful artefacts from sequences of bits. This connects us with the notion of virtualisation, which is not explored in O.A.I.S.

Can the O.A.I.S. Information Model help us to build more automation into the interpretation of archived data?

What does virtualisation mean, in the context of the O.A.I.S. Information Model?

Example: An XHTML Document

Take, for example, a simple Web page, written in English, and encoded as an XHTML 1.0 document using the UTF-8 character set.

Given the encoded sequence of bits, how do we reconstruct the Web page?

One possible recipe is as follows:

  1. Obtain a sequence of UNICODE characters from the sequence of bits, using the UTF-8 encoding standard.

  2. Obtain an XML document (technically speaking, an XML infoset) from the sequence of UNICODE characters, by parsing according to the XML 1.0 standard.

  3. Obtain an XHTML document from the XML infoset, by processing according to the XHTML 1.0 standard and the Document Object Model (level 2).

  4. Render the XHTML document to screen, print, voice or other media, following the rendering rules given in the XHTML 1.0 standard.

This isn’t the only recipe we could follow, but it illustrates the point that, however we do it, a number of steps will typically be involved.

What does this mean for my analysis so far?

Blurring the Line Between Data and Information

The first question this example raises is, where does the data end and the information begin?

It’s easy to see that the original sequence of bits is a data object, but what about the sequence of UNICODE characters? The XML infoset? The XHTML document? The Web page rendered on screen? At what point does the data become information?

Until now, we’ve stuck with the O.A.I.S. notion that data interpreted using representation information yields information. But can we say that a sequence of bits can be interpreted as a sequence of UNICODE characters, or that a sequence of UNICODE characters can be interpreted as an XML infoset, or that an XML infoset can be interpreted as an XHTML document?

If we can use “interpretation” in this way, then clearly we have to relax our distinction between data and information – the dividing line between data and information becomes blurred, and doesn’t help us to understand what we mean by interpretation.

Cooking with Representation Information

In the example above, we have a data object – the encoded Web page – and some items of representation information – the UTF-8, XML 1.0, XHTML 1.0 and DOM standards.

However, if an archive were to supply someone, say 20 years in the future, with the data object and these four items of representation information, this wouldn’t be very useful. The person might eventually figure out what order to apply the standards in and what to do with them, from reading and understanding the standards themselves. However, it would be much more useful if the person were also given some sort of a recipe along with the representation information, which tells them what to do with it all – at least in outline.

It would be even better if the person were also given some help in finding and composing some software components to implement the recipe.

Structure and Semantics

The O.A.I.S. Information Model makes a distinction between three types of representation information – structure, semantics, and other.

From section 4.2.1.3.1., the structure information of the representation information describes

“the format, or data structure concepts, which are to be applied to the bit sequences and that in turn result in more meaningful values such as characters, numbers, pixels, arrays, tables, etc.”

From the same section, the semantic information will include

“special meanings associated with all the elements of the Structural Information, operations that may be performed on each data type, and their interrelationships.”

Finally,

“Representation Information contains both Structure Information and Semantic Information, although in some implementations the distinction is subjective .” (My emphasis.)

In our example above, knowing that the Web page is written in English is clearly semantic information, and the UTF-8 encoding standard is clearly structure information; but the categorisation as either structure or semantics is not so clear for the others.

Conclusions?

So, to draw some tentative conclusions, based on a very limited analysis so far …

Given a sequence of bits, some representation information is clearly needed to turn those bits into something useful. But simply providing items of representation information, without any instructions about what to do with them, is not enough.

Indicating that some items of representation information contain structure information, and others contain semantic information, has limited usefulness. It may be useful, because the structure information is typically used before the semantic information, however where there may be several layers of structure and semantics, and we still won’t have all the instructions we need.

Providing a recipe, which specifies a work flow within which various items of representation information are applied, is much more useful. Such a work flow is useful without needing to make any distinction between data and information. Such a work flow is also useful without needing to know whether some item of representation information is about structure or semantics.

This is not to say that the notions of data, information, representation information, structure information and semantic information aren’t useful. They are all useful when it comes to understanding the problem, and planning solutions at a high level.

However, when it comes to designing concrete systems at lower levels of detail, grounded in the practicalities of handling and working with archived data, these notions tend to break down. They don’t stand up to a formal analysis, and are probably not suitable for direct translation into formal specifications or software designs.

Moving Forward – Virtualisation Work Flows

My tentative conclusions suggest that, if we want to provide the kinds of “recipe” for reconstructing archived information I’ve hinted at above, then we need to go beyond the O.A.I.S. Information Model.

Clearly the notion of work flows is important, as is virtualisation. These I hope to explore in further notes.

Advertisements