The OAIS. Information Model Revisited Part 3. Towards Models for Interpretation/Virtualisation Recipes

by Alistair Miles

In this note, I begin to explore the use of the Eriksson-Penker UML extensions for business process modeling, as a tool for modeling the processes or work flows required to successfully interpret or virtualise a digital object.

Previously, in part 1 of this series, I explored the abstract notions of data, information, representation information and interpretation, as defined by the OAIS Information Model. In part 2, I tried to apply these notions to a simple example of a Web page. I found that we need to go beyond the OAIS Information Model if we want to capture and represent the “recipes” that take you from a sequence of bits to something more useful, in the general case where there may be multiple steps or stages required to process, virtualise or render a digital object.

Recipes and Dependencies

Take again the example from part 2 of a simple Web page, encoded as an XHTML 1.0 Transitional document using the UTF-8 character set, and stored as a single sequence of bits.

I’m interested in modeling the “recipe” that tells me how to turn the encoded sequence of bits back into a Web page, because this recipe will define the “dependencies” for the preserved object. By “dependency” I mean those items of information and/or software that are required to execute the recipe — the ingredients and utensils, to use the cooking analogy. Note that by “execution” I do not necessarily mean execution by a computer — steps in a recipe might well be entirely manual.

If I knew what these dependencies were, I could then compare them with the knowledge and software currently held by the designated community (DC), and decide which of the dependencies also need to be preserved.

I could also design a system which computes any “gaps” that arise between the knowledge and software held by the designated community and those required for execution of the recipe. This is one of the goals of the CASPAR project.

Modeling the Recipe as a Process

Take the first step in decoding and rendering the Web page — turning a sequence of bits into a sequence of UNICODE characters.

The diagram below models this step in a general way, using the Eriksson-Penker UML extensions for business process modeling. The model defines an atomic process “Decode UTF-8”, which takes as input a “Bit Sequence” and generates as output a “UNICODE Character Sequence”.

Decode UTF-8 Process

Ideally, this step would be automated. I.e. a piece of software is used to execute the atomic process. If I understand Eriksson-Penker correctly, we could model this piece of software as a resource, acting as a supply object in the process.

Decode UTF-8 Process (2)

From this model, we can deduce that execution of the recipe depends on a piece of software which is capable of decoding UTF-8. As long as the designated community has a piece of software which can fulfill this contract, then the recipe can be executed, and nothing else need be preserved.

However, what if at some point in the future, the designated community no longer has such a piece of software? The process to decode UTF-8 would have to be done manually, or a new piece of software would have to be written, to execute the recipe. In either case, some information would be required — the UTF-8 encoding standard. We could model this standard as information, acting as a supply object in the process.

Decode UTF-8 Process (3)

Complications — The Roles of Information and Software

A complication now arises, when we want to say, “if you have a UTF-8 decoder, use that, otherwise you’ll need the UTF-8 encoding standard.” In other words, the recipe so far has a dependency on a UTF-8 decoder, and on the UTF-8 encoding standard, but not on both at the same time — either/or will do.

If we don’t capture this either/or clause, then we would end up in difficulty. To realise why, consider the case where the designated community is in possession of a software component that can decode UTF-8, but does not have knowledge of the UTF-8 encoding standard itself. In this case, we would spuriously believe that there was a “gap” in our preservation system, because the UTF-8 encoding standard dependency is not met by the designated community’s knowledge base, when in fact it isn’t needed.

Also, note that there are two possible responses to the situation where the designated community no longer has software to decode UTF-8. We could preserve some software, which can be compiled and/or executed on the DC’s current software platform; or we could preserve the UTF-8 encoding standard.

If we model instead a separate process to construct a UTF-8 decoder, then we can at least compute the priority of the dependencies. We can realise when a vital resource in the network of dependencies is not available, and then can decide between falling back on supplying the information required to construct the missing component, or archiving the component itself.

Decode UTF-8 Process (4)

It’s worth noting that there are at least three types of “dependency” in the discussion so far. There is the dependency of a process on software resources required to execute the process. There is the dependency of a process on information resources which may be required to execute the process. And there is the dependency of one software component or system on another.

Options and Responsibilities — Extending Recipes into the Archive

Finally, say the UNICODE standard falls out of use, and is replaced by XYZ character encoding. If the DC has an XYZ decoder, another option may become available — convert UTF-8 to XYZ. We can model this choice, however note that what we are modeling is no longer strictly a recipe we expect to pass entirely to the recipient — rather, the archive could take responsibility for the conversion step, which is a migration activity — so this model might be a bit confusing, but is interesting to consider.

Decode UTF-8 Process (5)