Recently I’ve been involved in designing a software system (dsn-chassis) to support data-sharing for the World-Wide Antimalarial Resistance Network (WWARN). We’re also involved in developing and maintaining software for other data-sharing networks, such as MalariaGEN. Thus we have a vested interest in figuring out what requirements these data-sharing networks have in common, so we can identify common standards and software components that can be adapted and re-used. Naturally, we want to minimise the overall effort of building and maintaining supporting infrastructure for each new data-sharing community.
Applications and Services
One shift in perspective that I’ve encouraged is a move from viewing the software required to support a data-sharing network as a single, monolithic, web application, to viewing the software as a collection of web services and applications that are loosely coupled via open protocols and formats. The hope is that this view will make it easier to identify the most generic components of the required infrastructure, such as metadata persistence and query, or authentication and authorisation. We could then adopt service protocols and formats that give us the broadest possible choise of re-using and building on existing open-source software, the smallest possible dependency or lock-in to any one vendor or product, and the flexibility to extend and customise to cope with those requirements that are unique to a given data-sharing network.
That’s the ideal, anyway.
Metadata Persistence and Query
One capability that is common to the data-sharing networks we’re involved in is management of metadata relating to scientific data being shared within the network. This metadata includes, among other things, information about the study from which the data originates, such as the scientific protocols (i.e., procedures) that were used in generating the data. This metadata can be complicated, and varies substantially between different types of experiment. It is, however, needed to evaluate the quality and comparability of data from different studies, which is a prerequisite for aggregating those data in a sensible way. To get a sense for the type of metadata that needs to be captured, check out an early prototype of the study questionnaire being developed for WWARN.
Although the exact nature of this metadata will differ from one data-sharing network to the next, the basic capabilities to persist (CRUD) and query those metadata is common. Hence, one of the first things we’ve done is to take a look at various APIs and protocols for persisting and querying arbitrary packages of metadata, and their available implementations.
Atom Publishing Protocol
One protocol that stood out was the Atom Publishing Protocol and the related Atom Format, for a number of reasons. First, it follows the REST architectural style, which implies a number of helpful constraints. Second, there are at least a few reasonably mature open-source implementations (e.g., eXist, AtomServer) that are geared not just towards blog publishing but to arbitrary data and metadata management. Third, big players like Google and Microsoft are standardisting on Atom as a basis for their Web service APIs – not that I like to flock, but reading the Google Data APIs especially gave me some confidence that many of the difficult practical issues have already been encountered and solutions found.
One of the most compelling reasons I’ve found to use the REST style in designing a Web service API, and to follow the approach of mapping the four basic persistence-related operations (Create, Retrieve, Update, Delete) on to the appropriate HTTP verbs (POST, GET, PUT, DELETE), is simplicity and ease of implementation. A second reason is the ability to decouple the persistence protocol from the data model, at least to some extent. In my experience, these two factors in particular contribute to being able to rapidly develop prototypes, and to carry as little baggage as possible forward as those prototypes (and the underlying data models) inevitably evolve towards a production system.
We’ve also been working with the eXist open-source XML database, which implements AtomPub out of the box, as our metadata persistence service implementation. One of the nice things about eXist is that it will store whatever XML you throw at it, so you can fiddle with your Atom content extensions to your heart’s content without ever needing to touch the server-side code or configuration. Another nice feature is that it supports Web service endpoints implemented as XQuery scripts, which is a very convenient way to add a wide range of query service capabilities to the data and metadata you’ve stored via the Atom service.
Our current, early, prototype of a data management system for WWARN uses eXist as the Atom service implementation, and a GWT application that we’ve rolled ourselves as the user-interface.
While this approach has proved excellent during the development of early prototypes, there are a number of challenges approaching in moving the system towards production strength that are causing me some consternation. My main reason in writing this article is to highlight those issues, in the hope that others can provide some helpful ideas and advice.
I (somewhat pithily) entitled this article “REST-not-so-easy” because, while some of these issues are specific to AtomPub, others seem common to any Web service API based on the REST style. It could very well be that I’m missing a few rather obvious and simple solutions, I rather hope that’s the case.
So, on to the issues…
When building our user interface, we commonly found that a single view required us to retrieve data not just from a single Atom entry, but rather from a graph of linked Atom entries. For example, in our data model,
Dataset entries are linked to
Study entries, which represent the study from which the dataset originates. When the user is viewing a dataset, they also want to view some basic information about the linked study, such as its title and summary, in addition to a hyperlink that allows them to navigate to further information about the study.
The client application can, of course, perform two HTTP GET operations to retrieve this simple graph of two linked Atom entries, which is easy enough. However, we have other examples where the graph to be retrieved has a depth of 2 or more. In these cases, implementing the client application becomes far simpler if the whole graph can be retrieved with a single HTTP GET operation, rather than many. A single GET is also better for latency, which is an important concern for us where users may be located in parts of the world with poor network bandwidth.
A workaround for this that we cooked up early-on was to develop query services that returned a single, root Atom entry as the result, with the required links outbound from and inbound to the target Atom entries expanded inline, within the
atom:link element. I was initially unsure of the sanity of this approach, especially given that including an
atom:entry directly within an
atom:link element does seem to break the Atom Format Spec. However, I found some discussion from folks at Microsoft and Google on doing inline link expansion, which gave me some confidence that this idea isn’t so crazy. I also found a suggestion that returning the graph of linked Atom entries as a feed, rather than a single Atom entry with links expanded inline, was a more elegant way to go, which I have some sympathy for. However, we’ve gone with the inline expansion approach for now, which is working well so far. For example implementations using XQuery, see chassis dataset query service and the supporting function library.
There is an obvious gotcha here, which is that if you retrieve an Atom entry with links expanded inline, then PUT that entry back to the edit link URL, you will end up storing the linked entries inline too, which leads to all sorts of interesting errors. So, any view that needs to do a PUT must first retrieve a fresh, unexpanded representation of the entry, before it can be modified and PUT back to the server.
I’ve also glanced at an article on AtomServer which describes the idea of aggregate feeds created by joining separate collections. It looks like this might provide a similar capability, but in a quite different way. However, if I have understood the idea correctly, it looks like you could only ever fetch a graph on entries 1 deep.
Referential Integrity; Broken Links
Our simple data model currently comprises entities such as
Review. There are, of course, associations between these entities, such as an assocation between a
Dataset and the
Study from which it originated, as mentioned above. When we map our logical data model onto Atom, each type of entity maps onto a distinct collection of Atom entries. Each type of entity also gets its own Atom content extension, which means that we stick the entity data into some XML nested in the
atom:content element of the Atom entry. Associations between entities are mapped onto links between Atom entries. I.e., we use the
atom:link element to represent associations between entities, where the value of the
rel attribute represents the type of the association.
To create a new association, a client retrieves a representation of the entry that is the source (i.e., subject) of the assocation, adds a new
atom:link element with the appropriate
rel attribute (describing the association type) and
href attribute (pointing to the target/object of the association), and then PUTs the entry representation back to the entry’s edit link URL.
By default, eXist’s Atom implementation does not do any referential integrity checks on the links in an Atom entry. This is perfectly sensible, because no integrity constraints have been declared, and because the link could equally point to an Atom entry anywhere in the Web, not just other Atom entries located on the same service.
However, if an Atom entry that is the target of one or more links is deleted, then all of those links will be broken. How should the service deal with this, if at all? Is it OK to do as the Web does, and leave it up to the client (and the user) to deal with broken links? Or should the service be a bit smarter, e.g. by noisily preventing deletion of entries that are link targets, or by silently deleting links to deleted targets, or by some other mechanism? What about new links that are created to non-existent targets? Should those be prevented somehow? And should the service differentiate between links whose targets are entries hosted by the same Atom service and links whose targets are elsewhere in the Web? If so, how? I.e., are there different kinds of referential integrity that need to be considered?
Our current, very tentative, position is to do as the Web does, and leave it to the client to handle broken or bad links. The client will typically do this by simply notifying the user of a broken link. The client will provide the facility to allow the user to fix broken links, but what the user does next is up to them. However, we haven’t implemented any of this functionality in the client application yet, so I don’t know what the consequences will be.
Sometimes, what is a single operation from the user workflow point of view, such as creating a study, or creating a dataset, or updating a study with new information, maps onto a single Atom Protocol request, such as an HTTP POST or PUT. However, often a single user operation maps onto multiple Atom Protocol requests. For example, creating a new revision of a data file involves four HTTP requests, which are a POST (create a new media entry that represents the data file revision), a PUT (add some metadata about the media entry, e.g., the original file name), a GET (retrieve a fresh representation of the data file entry), and finally a PUT (link the data file entry to the new revision).
Any of these requests could fail, leaving the data in an inconsistent state. How should we handle this possibility?
The obvious answer is to provide some sort of transactional capability, such that the client can invoke these requests within the context of a single transaction that will either completely succeed or fail with no change to the data. But that begs two further questions: how do you add transactions to the Atom Protocol, and how do you implement them?
In my not-very-extensive searching of the Web, I have not found much in the way of discussion and/or implementation of transactional capabilities for REST-style web service APIs, except for some work on transactional support for JAX RS based applications. An alternative to a protocol extension is simply to design the Web service API to expose only those operations that are atomic from the user workflow point of view, then handle transaction-type issues behind the scenes. But that would mean leaving AtomPub behind and layering another API on top, which seams to completely defeat the purpose of going for REST/AtomPub in the first place.
The other option is, of course, to leave it up to the client to deal with inconsistent data, which might include trying cleaning it up automatically or simply notifying the user of problems and leaving the rest up to them, as with the discussion of broken links above. But that would leave open all sorts of weird and wonderfully wrong possible states of the system.
If someone were to point me to an extension to the Atom Protocol that allows multiple Atom requests to be carried out in the context of a single transaction, and an existing, mature, open-source implementation, I would be very happy .. I think.
AtomServer has a capability to combine several Atom requests into a “batch”, via a custom extension to the Atom Protocol, based on the batch processing capability in the Google Data APIs. Although I couldn’t find any mention of adding transactional processing of an entire batch in the AtomServer documentation, it does mention that a batch is processed as a single database operation, and so it shouldn’t be too difficult to wrap that with a transaction. I’m not sure how that would impact the AtomServer batch response model, however.
The Atom Publishing Protocol, as with any CRUD-style API, is relatively coarse-grained, in the sense that if you want to update any field of an entry, you have to update the whole thing at once. This coarse-grained nature is good from one point of view, because it means that the details of the data model are kept separate from the design of the protocol. The data model can then evolve without needing any change to the protocol specification or implementation. However, it does mean that the protocol can be quite inefficient and wasteful in terms of network bandwidth. I.e., if you’ve stored a reasonable amount of data in a single Atom entry, but all you want to do is change the title, you’ll have to PUT the whole thing. This can mean noticeable latency in the client application, especially where network bandwidth is poor.
On the face of it, this issue is less serious than dealing with broken links or inconsistent data, but it does make other protocols that provide the ability to submit a fine-grained change set or change request for a particular data or metadata record (such as the Talis Platform API) start to look quite appealing.
When you delve a bit deeper, this issue also interacts with the issue of authorisation and access control, which I’ll come to below.
I’ll just mention this briefly to say that, for us, authentication is not an issue. We use a Spring Security filter to implement authentication using HTTP Basic in the development environment, and webauth in production.
I have seen articles on inventing new authentication protocols for Atom to work around restrictions imposed by their server environment, but those restrictions don’t apply to us.
As a side note, I suspect there is a bug in the authentication code in the eXist Atom servlet, as we had some pain trying to turn it off and use Spring Security for authentication instead (for HTTP Basic authentication, we had to add a filter that removes the
Authorisation header before it gets to the eXist Atom servlet, otherwise the servlet’s own authentication process was activated), but I haven’t pinpointed it yet. I’d like to see eXist factor out and consolidate all authentication code from its Atom and XQuery servlets to dedicated filters, which would make for easier integration with Spring Security or other authentication frameworks .. but that’s another story.
Security, Authorisation and Access Control
I’ve saved the best for last.
Within a data-sharing network, typically different roles are defined, with different privileges/permissions/authorities/rights/acls/…
For example, for WWARN we have defined a
submitter role, which is a person who wants to share original research data with WWARN; a
gatekeeper role, who performs an initial review of submitted data and decides whether or not the data should be accepted for curation; a
curator role who’s job is to clean up and standardise data submitted from different studies so that it can be sensibly aggregated; a
coordinator who oversees the operations of the data-sharing network; and an
administrator who installs, configures and maintains the systems.
There are things people should be able to do. For example, a
submitter creates studies, datasets, and data files, and can submit a dataset to the network. A
gatekeeper reviews submissions and assigns a curator if the submission is accepted. A
curator creates derived data files and reviews curated data for validity and conformance with standard data dictionaries.
There are, of course, things people should not be able to do. For example, a
submitter cannot review their own submission or decide that their submission should be accepted by the network. Neither can a
submitter assign a curator to their submission. These capabilities are allowed for the
These permissions can be implemented at the user-interface level. I.e., the user-interface can expose only those functionalities that are permitted for the user’s role(s). However, if the client application is using a Web service to implement some or all of these operations, then the Web service API must also have appropriate constraints, otherwise users would be able to hack the API and do things they shouldn’t. If the Web service API is based on Atom, then the problem we have is how to implement the appropriate authorisation constraints in a way that works with Atom and existing implementations.
Some constraints are simple to implement. For example, the constraint that only a
submitter can create a study can be implemented by allowing only POST requests to the
Studies Atom collection URL if the user has the
submitter role. I.e., constraints that can be mapped onto a specific class of CRUD operation on a specific Atom collection with a specific role are straightforward, and could, for example, be implemented using a Spring Security filter and URL patterns.
Others are not so simple. For example, a
submitter who creates a new study becomes the owner of that study. The original owner may also grant ownership to other
submitters; the owner(s) of a study are the only people who can update information about that study. This begs the questions, how do we represent entry-specific access control constraints in the Atom Protocol and/or Format, and what implementations (if any) are available?
The eXist database does have its own solution for Atom security. You can declare access control constraints for a specific Atom entry by including an
exist:permissions element within the Atom entry. However, this presents three problems for us. First, it requires that we use the eXist database of users, but we have to integrate with another, external database of users, so we would have to keep those two synchronised. Second, it is based on the Unix/Linux model of file system permissions, which is too inflexible. For example, we want to enable a user to grant arbitrary permissions for a given entry to arbitrary collections of specified users. Having a single owner/group for each entry means you cannot do this sort of thing. Third, the format is specific to eXist. If we buy into their permissions format, we will have a job to port it to another atom implementation such as AtomServer, if we need to do that at some point in the future.
The Google Data APIs have a different approach. See for example the Google Calendar API section on sharing calendars. Each access control rule is an Atom entry. There is one collection of access control rule entries for each calendar. Each access control rule has a “scope”, which typically specifies a user, and a “role”, which is typically a permission such as “read” or “editor”. See also the gAcl namespace reference.
The Google approach has the flexibility we need, but I’m baulking at implementing something like this ourselves on top of eXist or another Atom implementation.
And there is another issue here. It’s not hard to imagine situations where you might want different users to be able to update only specific parts of an Atom entry, but not the whole thing. I.e., you might want to have fine-grained, within-entry access control rules for different users. This sort of thing can be handled in a purely Java environment, for example, using Spring method security and Spring domain object security. However, it’s not obvious how to implement this sort of thing in a general way for REST/Atom, and so we’ve deliberately designed our data model to avoid needing this kind of rule. I.e., rules are only ever defined at the granularity of a single entry, nothing finer. This means that authorisation considerations have a significant impact on the design of the data model. Not necessarily an issue, but interesting nonetheless.
We need a way forward here, which means minimum coding for us (we want to focus our effort on the applications, rather than the underlying services), and maximum simplicity and portability.
If you’ve read this and have any thoughts, ideas or suggestions about any of the issues above, really anything at all, no matter how trivial, please do drop me an email, or add a comment to this article, or join the dsn-chassis group and post a message there. Thanks for reading.