Versioning and the Web

by Alistair Miles

This post looks at some of the problems of identifying, decribing and linking “versions” of “digital objects”, from the point of view of the Web, drawing especially on the Architecture of the Web published by W3C. These thoughts were stimulated by the recent kickoff meeting of the new Version Information Framework (VIF) project, at which versioning was discussed in the context of adding value to digital repositories — I hope this post provides some useful input to the VIF project team.

One of the difficulties of talking about information is that it is so intangible. A book on a shelf or a file on a computer hard drive are tangible enough, but when we come to talk about a book in abstract (e.g. the complete works of Shakespeare) or a Web page, things become less tangible. One of the problems is that our normal vocabulary for talking about things at these intangible/virtual/abstract levels is not very well developed — e.g. we use “book” to refer both to a physical book on a shelf and the abstract notion of a book (which may have many copies on many shelves).

The Functional Requirements for Bibliographic Records (FRBR) specification goes a long way towards improving this situation, giving us vocabulary for four separate levels of abstraction — Work, Expression, Manifestation and Item. However, these concepts can be ambiguous and hard to apply in some situations. Nevertheless, they are an important reference point.

Web Architecture

On the Web, we face a similar situation, in that the vocabulary for talking about different levels of abstraction and the relationships between them is not well developed. However, the Web is by nature a virtual information space, and the Architecture of the Web provides some very interesting mechanisms for talking about information as a virtual commodity. The Web has also had to deal with the conceptual relationships between an information resource as a virtual entity (i.e. a Web page), and the many different media types (i.e. data formats, e.g. HTML, XML, PDF) and languages (i.e. English, Japanese etc.) which can be used to transmit or carry information.

From the Architecture of the Web, Volume One:

The World Wide Web uses relatively simple technologies with sufficient scalability, efficiency and utility that they have resulted in a remarkable information space of interrelated resources, growing across languages, cultures, and media. In an effort to preserve these properties of the information space as the technologies evolve, this architecture document discusses the core design components of the Web. They are identification of resources, representation of resource state, and the protocols that support the interaction between agents and resources in the space.

According to the Architecture of the Web, the Web is built from “resources”, and in particular, “information resources”. In practice, the notion of an “information resource” is hard to pin down, but an “information resource” is roughly defined as a resource whose essential characteristics can be conveyed in a message. Resources are identified by URIs, and you can use a URI to access a resource. The most common type of access involves requesting a “representation” of a resource, which is also known as “dereferencing a URI”. This is what your Web browser does when you type a URI in the address bar and click “Go”.

Content Negotiation

One of the interesting properties of the Web is the ability to provide several alternative representations in different media types from the same URI, which is supported in the HTTP protocol. This is also known as “content negotiation“. When a user agent (e.g. a Web browser) requests a representation of a resource, it specifies which media types it can “accept” in response. The server then sends a response in the media type best matching the request.

To take a very simple example, the URI “http://www.w3.org/Icons/w3c_main” identifies the main W3C logo icon. This image is available in two different media types — GIF and PNG. You can vary which representation you receive by changing the “Accept” header in the HTTP request.

So the URI “http://www.w3.org/Icons/w3c_main” clearly identifies something virtual, which is “above” (at a higher level of abstraction than) the notion of media type or format. Let’s call this a content negotiable resource.

Media type is not the only axis along which representations of a resource may vary. A multilingual Web site can provide representations in different languages from the same URI. For example, the Debian Web site “http://www.debian.org” provides alternative representations in English, French, Spanish and many other languages — i.e. this is also a content negotiable resource.

So it’s clear that an information resource might be a content negotiable resource — something for which representations might be provided in one or more media types and one or more languages.

However, many information resources in the Web only provide a single representation, in some media type and language. For example, “http://www.w3.org/Icons/w3c_main.png” only provides a PNG representation of the main W3C logo. Let’s call this a content invariant resource.

The interesting thing for our discussion of “versioning” is that we might express a relationship between a content negotiable resource A and a content invariant resource B, where the single representation provided by B is equivalent to one of the representations provided by A.

If we had some appropriate RDF vocabulary, we might even state this formally, e.g.


# links between resources
 rdf:type vif:ContentNegotiable.
 rdf:type vif:ContentInvariant.
 rdf:type vif:ContentInvariant.
 vif:contentVariant .
 vif:contentVariant .
 vif:contentAlternate .

# definition of the vocabulary we used above
vif:ContentNegotiable rdfs:subClassOf vif:InformationResource.
vif:ContentInvariant rdfs:subClassOf vif:InformationResource.
vif:contentVariant rdfs:domain vif:ContentNegotiable; rdfs:range vif:ContentInvariant.
vif:contentAlternate rdfs:domain vif:ContentInvariant; rdfs:range vif:ContentInvariant; rdf:type owl:SymmetricProperty.

If this metadata were made available to an application (user agent), the application could then make a user aware that various alternative representations of some Web resource were available, for example.

Changes Over Time

The other interesting property of the Web architecture is that things can change over time. For example, if you go to “http://www.bbc.co.uk/news” today, you’ll see something different from what was there yesterday. Yet, the URI “http://www.bbc.co.uk/news” identifies an information resource, so this resource must be changing over time. Let’s call this a changeable resource.

On the other hand, some Web resource haven’t ever changed, and are promised to never change. For example, “http://www.w3.org/TR/2004/REC-webarch-20041215/” identifies a time-specific “edition” or “version” of the Architecture of the World Wide Web technical report. Let’s call this an unchangeable resource. However, note that W3C also provides the URI “http://www.w3.org/TR/webarch/“, which always corresponds to the latest version in the report series.

The interesting thing for our discussion of “versioning” is that we might express a relationship between a changeable resource A and an unchanging resource B, where the representation provided B is equivalent to the representation provided by A at some specific point in time. If we had some appropriate RDF vocabulary, we might state this formally, e.g.


# links between resources
 rdf:type vif:Changeable.
 rdf:type vif:Unchanging.
 rdf:type vif:Unchanging.
 vif:snapshot .
 vif:snapshot .
 vif:priorState .

# definition of vocabulary used above
vif:Changeable rdfs:subClassOf vif:InformationResource.
vif:Unchanging rdfs:subClassOf vif:InformationResource.
vif:snapshot rdfs:domain vif:Changeable; rdfs:range vif:Unchanging.
vif:priorState rdfs:domain vif:Unchanging; rdfs:range vif:Unchanging.

If this metadata were made available to a user agent, the application could then make the user aware that a more recent representation of some Web resource was available, or that a history of changes to that resource was available, for example.

Design Patterns

So on the Web, there are content-negotiable resources and content-invariant resources; and there are changeable resources, and unchanging resources. The Web Architecture itself is entirely ambivalent to which is which. However, we might make use of these different classes, to provide some useful functionality to users of digital repositories, accessible through the Web.

Given that any information resource in the Web could be either changeable or unchanging, content-negotiable or content-invariant, there are four different possibilities to consider for every resource. This makes for a fairly complicated set of possible interrelationships. However, we might define some design patterns, which could be useful in particular situations — for example, in exposing information from digital repositories as part of the Web.

We might, for example, describe a three-level pattern.

At the top level is a changeable, content-negotiable information resource.

At the second level are a set of unchanging, content-negotiable information resources. These are less abstract, corresponding to a specific snapshot or “revision” of a resource.

At the bottom level, are a set of unchanging, content-invariant information resources. These are the most contrete entities, corresponding to a specific content variant of a specific snapshot of a resource.

The diagram below in an attempt to illustrate this pattern.

Version pattern

The interesting thing about this pattern is that we might say very concretely how this pattern should be implemented in the Web — i.e. how Web servers could be set up to help user agents understand better the differences and interrelationships between the various information resources. For example, the URI denoting the top level information resource could be set up to redirect to the URI of the most recent snapshot at the second level of information resources; these URIs at the second level could then be set up to content-negotiate directly as per the HTTP protocol, but offering information in the “Content-Location” HTTP header which gives the user agent a clue as to how to link to a specific content variant.

We could also cook up some RDF vocabulary to formally express all of these interrelationships, as shown in the examples above, then publish this in the Web or embed it in web pages. This would allow user agents to be fully aware of the structure of the information space, and do even more intelligent things like make the user aware that he/she is viewing an out-of-date version and that a more recent version is available, or that content variants are available in a range of languages.

Of course, this is only one of a number of possible design patterns, which need further exploration. But hopefully, this at least gives a few ideas as to how the architecture of the Web — a virtual information space — might shed light on a discussion of identifying, describing and linking “versions” of “digital objects”.

Advertisements