DC2009 Talk Notes – Towards Semantic Web Deployment: Experiences with Knowledge Organisation Systems, Library Catalogues, and Fruit Flies
by Alistair Miles
First, let me say that it has been my pleasure to attend Dublin Core conferences since 2005 in Madrid.
Thanks to the organisers here for putting on a great conference, and for inviting me to give this talk, it is a real honour.
I will post notes from this talk on my blog at purl.org/net/aliman – so don’t worry if you miss anything.
It is also excellent timing, as after at least 5 years of talking about it, I can finally say that, on 18 August 2009 SKOS – the Simple Knowledge Organisation System – was published as a W3C Recommendation, thanks in large part to the experience and support of this community.
I’d like to talk more about SKOS later in this presentation.
Before I get into the main body of my talk, let me say first that, one of the lessons I’ve learned in working on Semantic Web Deployment, especially in the last 2 years, is that big ideas can look great when viewed from a distance, but the reality on the ground will always be far more complex and surprising than you anticipate, as I’m sure many of you can testify to.
In particular, the cost/benefit trade-offs for investing in a particular technology or approach can vary wildly from one situation to the next. Nothing is more important than keeping an open mind.
So, although I love talking about this big ideas, I’m going to try to put them to one side for this morning (although, I’m sure I’ll still be tempted by one or two). Instead, I’d like to simply share a few experiences, and hopefully give you at least a flavour of where some of the opportunities and the challenges might lie.
Now, given that we’ve been talking about metadata for at least two days now, I thought I’d spend a few minutes talking about something completely different.
These are fruit flies, of the species Drosophila melanogaster.
On the left is a male, on the right a female. Females are about 2.5 millimeters long, males are slightly smaller.
When I first saw this image, I thought it was a photograph. But, when I looked closer, I noticed it was in fact a drawing, and it was signed “E Wallace”.
Edith Wallace, I found out, was curator of stocks and artist for Thomas Hunt Morgan, who c. 1908 began using fruit flies in experimental studies of heredity at Columbia University.
Heredity simply means the passing of biological traits, such as green eyes or brown hair, from parents to offspring.
I’d like to talk more about Thomas Hunt Morgan, but before I do, it’s worth noting that when Charles Darwin published his book On The Origin of Species by Means of Natural Selection in 1859, which of course depends in its central argument on the inheritance of traits from one generation to the next, the underlying mechanisms of that inheritance, i.e., of heredity, were completely unknown.
It was, of course, Gregor Mendel’s work on pea plants, published first in 1865, but not widely known until the turn of the 20th century, which first suggested that organisms inherit traits in a discrete, distinct way. It is these discrete units or particles of inheritance that we now call genes.
However, at the turn of the century, although the theory of genes was accepted, it still was not known which molecules in the cell carried these genes.
It was T. H. Morgan’s aim to determine what these molecules were.
He needed a suitable animal to study, and chose Drosophila, primarily because of cost and convenience – they are cheap and easy to culture, have a short life cycle and lots of offspring.
Morgan was looking for heritable mutations to study, and spent a long time looking before he discovered a few flies with white instead of the usual red eyes. The white-eyed mutation only occurs in male flies, i.e., it is a sex-linked trait. The fact that the trait depends on the sex of the individual suggested to Morgan that the genes responsible for the mutation reside on the sex chromosome … and that the chromosomes generally are the carriers of genetic information.
Morgan and his students also discovered that some traits, such as wing length and eye shape, do not get mixed up randomly from one generation to the next, but rather tend to be inherited together.
They demonstrated that the reason why these traits tend go together is because the genes responsible are on the same chromosome, and are in fact quite close together on the same chromosome, and hence tend to stay together when chromosomes recombine during formation of sperm and egg.
They then used these observations to construct the very first genetic maps, that is, maps of where different genes are in relation to each other on the chromosomes. This was the first crucial step in understanding how a genome is organised.
Morgan started something of a trend when he chose Drosophila, because for the last century Drosophila has been one of the most intensively studied model organisms.
It is fair to say that much of what we know today about genetics, and the molecular mechanisms underlying development, behaviour, aging, and many other biological processes common to flies and humans, we know from research on fruit flies.
In the last decade, that research has entered an entirely new phase.
In 2000, the complete genome sequence of Drosophila melanogaster was published. For the first time, we gained a complete picture of the location and DNA sequence of every gene in the genome. Publication of those data has unlocked entirely new methods and avenues of scientific investigation.
However, many questions remain unanswered.
What do all those genes actually do?
Where and when do they play a part in various biological processes?
And, especially, how do genetic differences between individuals relate to different biological outcomes?
E.g., translating that question from flies to humans, why do individuals with one genotype tend to be resistant to a disease such as malaria, whereas others don’t?
Answering these types of questions is the domain of functional genomics.
A functional genomic study typically asks, in relation to some biological process like development of sperm or the eye, …
What genes are active, where, when and how much?
Which genes are interacting with each other?
What happens if you stop a gene from working (knock it out), then restore it?
For example, this image shows that a particular gene called schumacher-levy is active during the developing Drosophila embryo, and is localised to a specific organ, in this case the developing gonad.
Other advances in biotechnology over the last decade have revolutionised this type of research.
For example, we now have a number of tools for carrying out high-throughput experiments. These are experiments where, rather than looking at just a handful of genes, we can look at 100s or even 1000s of genes at a time.
The bottom line: these high-throughput technologies, in addition to rapid advances in genome sequencing technology, generate a very large amount of highly heterogeneous data.
And our ability to generate new data on unprecedented scales is accelerating. Really, the pace of change is staggering.
To give you an idea of the rate and scale of new advances, consider that, in July 2008 the Wellcome Trust Sanger Institute, the primary institute for DNA sequencing in the UK, announced that, every two minutes, they produce as much DNA sequence as was deposited in the whole first five years of the international DNA sequence databases, from 1982 to 1987.
Finding, comparing, and integrating these data is a critical challenge for ongoing biological research in all organisms, and is especially critical in translating findings from model organisms such as Drosophila to human health.
Because this is such a critical problem, much excellent work has already gone into making Drosophila data publicly available and accessible.
For example, FlyBase provides access to primary genome sequence data on all sequenced Drosophila species. It is the primary reference point for all Drosophila genome-related data.
FlyBase also establishes a controlled vocabulary for all Drosophila genes, which is a vital tool in integrating data from multiple sources, because genes are often the point of intersection between data sources.
BDGP embryo in situ database is an example of a database holding the output of high-throughput functional genomic studies. They publish images that depict the expression of a gene within the developing fly embryo, at various stages during the course of embryo development, for thousands of genes.
FlyAtlas also publishes data from high-throughput experiments, but using a different technique. They use DNA microarrays to get quantitative data on gene expression in different tissues, so the data tell you not only whether a gene is active, but also how much it is being expressed.
Much of the focus has, to date, been on providing a direct user-interface to each source of data. So each data provider has a set of web-based tools for a human researcher to query and visualise the data.
However, the focus is shifting towards enabling data to be harvested and integrated across databases in an automated way, because it is recognised that this could save much time and effort, and because some questions just couldn’t be answered any other way.
And there are projects making headway here, for example, FlyMine uses a conventional data warehouse approach to integrate data on Drosophila. However, it is no small challenge, and there is a pressing need to make the end products much more usable, and because public funding will always be scarce, to make the whole process as cost-effective, scalable and sustainable as possible.
With this in mind, in January 2008, I moved to the Zoology Department at the University of Oxford, to work with a team there led by Dr David Shotton on a small project called FlyWeb.
FlyWeb asked two questions…
1. Can we build tools for Drosophila biologists that cut down the effort required to search across different sources of gene expression data?
2. Under the hood, what (semantic) web tools and design patterns help us to build cross-database information systems, and ensure that they are robust, performant and quick to build.
To answer these question, we set about building a proof of concept, which is deployed at openflydata.org.
How does openflydata.org work?
Each of the cross-database search applications I’ve demonstrated is, simply, a mashup.
Why did we use the mashup approach?
Simplest thing we could think to do. Also gave us flexibility to experiment with semantic web technology, without totally committing to it. I.e., we could mix semantic web and other solutions, if it proved easier to do so.
The bottom line was, we wanted to produce useful, compelling functionality for a biological researcher, in a reasonable time frame and with a reasonable level of performance and reliability. If semantic web helped, great, if not, fine, we’ll try something else.
Of course, it would have been great if each data source had provided a web service endpoint for their data with the necessary query functionality … but they didn’t, so we made some ourselves.
The approach we took was, for each data source, we first converted the data to RDF (the Resource Description Framework, one of the key Semantic Web Standards), then loaded the data into an off-the-shelf open-source RDF storage system (Jena TDB written by Andy Seaborne, now at Talis), then mounted each RDF store as a SPARQL endpoint.
What is a SPARQL endpoint?
It is, simply, a web service endpoint that uses the SPARQL protocol as its interface (API).
What is the SPARQL protocol? It is a simple HTTP-based protocol for sending SPARQL queries to be evaluated against a given data store.
What are SPARQL queries? SPARQL is a query language, roughly analogous to SQL (read only), but built for the Web. Basically, it gives you a way to ask pretty much any question you want to ask of a given set of data.
Why did we use RDF & SPARQL?
1. Rapid prototyping. Leverage off-the-shelf, open source, software, such as Jena, or Mulgara (see David Wood). In principle, we only had to write the software to convert the data from each source to RDF. We could then use OTS software to deploy an RDF database and SPARQL endpoint. (Caveat…)
2. SPARQL is a simple protocol with an expressive query language. That means its easy to write code for, but gives the client application (in our case, the mashups) the power to ask any question it wants. (Caveat…)
Point 2. also means we can offer these SPARQL endpoints as a service to the biological community, so others with a bit of savvy can ask their own questions. In particular, they can ask questions that we (the service provider) hasn’t thought of, which (in theory) promotes innovative re-use and exploitation of the data.
You’ll noticed I haven’t yet talked about either of the two key themes of this conference, semantic interoperability or linked data.
I haven’t mentioned these yet because, the point I’d like to make is that, depending on the context, there *may* be compelling, practical, short-term reasons to evaluate Semantic Web-based technology for a data integration project …
1. The (relative) ease of deploying a web service endpoint for querying a data source; i.e., of making the data accessible, lowering the barriers to re-use.
2. The (relative) simplicity of exploiting those web services to prototype light weight cross-database search and on-the-fly data integration applications. (pardon the pun)
Of course, the best choice of technology depends greatly on the existing technological context, and on the expertise available to you. As with relational databases, XML, or any family of related technologies, becoming productive with a new technology requires an investment in terms of people, training and time.
A second caveat is that, the more open and expressive a query protocol like SPARQL is, the harder it becomes to guarantee the performance and availability of a service using that protocol. It is a denial-of-service type problem. If anyone can ask any question they like, some people will ask hard questions, whether intentionally or by accident, which could degrade service performance for others.
SPARQL is thus a double edged sword. On the one hand, its open nature and expressivity is a major advantage, but that openness creates challenges when it comes to providing reliable and performant services to others.
I see resolving this tension as one of the key challenges for the semantic web community in making the technology widely applicable. We explored some strategies for mitigating these issues in FlyWeb, but we certainly did not find all of the answers.
Now, let’s talk about semantic interoperability and linked data.
The two classic problems you encounter when integrating data from different sources are…
1. schema alignment … each data source has a different data model … they structure their data in a different way … using different names for similar types of entities and relationships … or using similar names for what are actually very different types of entities and relationships …
2. coreference resolution … each data source may use different identifiers for the same thing (e.g., genes) … or (more rarely) identifiers may clash …
Our approach to schema alignment was not to try to completely align all our data sources in a single step.
Rather, we tried to pick the low-hanging fruit, and take an incremental approach, making use of existing data models as much as possible.
So, for example, the data from FlyBase come from a relational database, which is structured according to a relational schema developed over a number of years by the Generic Model Organism Database community, a schema they call Chado.
When we transformed FlyBase’s data to RDF, we used the Chado schema to help design the RDF data structures we were generating. In fact, we went one step further than that, and we semi-automatically generated an OWL ontology from the Chado relational schema. This ensured that we took a systematic approach to the data transformation, and that the definitions for the data structures in the output RDF could be grounded in the definitions already established by Chado.
Our approach to coreference resolution was, similarly, to make use of existing controlled vocabularies.
Our biggest problem was identifying genes. A single gene might be known in the scientific literature by many different names. This has been a perennial problem for Drosophila biologists, and a big part of what FlyBase has done has been to establish a definitive controlled vocabulary for Drosophila gene names, and curate a list of known synonyms for each gene.
So we used FlyBase’s unique gene identifier system as a foundation, constructed a set of URIs for Drosophila genes based on the FlyBase identifier. We then used these URIs to link data from each of the various sources.
You’ll notice I said “link data” just there. What do I mean by that?
Well, I’d like to make a distinction between two types of linking.
1. “semantically linked” – data from different sources use a common set of URIs to identify data entities, e.g., people, places, genes, diseases, (… or any two URIs that identify the same entity have been explictly mapped)
2. “web linked” – data are semantically linked, and URIs resolve to data so links can be followed by a crawler … this is what most people mean when they talk about “linked data”
We might also describe a third notion, …
3. “semantically aligned” – data from different sources use a common schema, that is, they share a common data model, (… or their respective data models have been explicitly mapped)
To build our cross-search applications, we went for the low-hanging fruit. I.e., we did just enough, and no more, to get them working.
This meant a very small amount of semantic alignment. In fact, it was quite possible to work around differences between data models, as long as those data models were understood. We certainly did not need to accomplish a complete and perfect alignment of all data models, before we could start building prototypes.
We also did not make the data web linked, i.e., we did not publish true linked data. Why not? Because it didn’t serve our immediate needs. We needed performant and queryable web services to data for each source, so we could build a mashup that selected the data it needed. Whether the data were actually linked in the web was, for this project, not relevant.
We did, however, work on semantically linking the data, i.e., mapping differences in the identifiers used, especially for genes, and this was absolutely critical to getting a reasonable level of recall and precision.
Which is not to say that I think true, web linked data is a bad idea. There may be other reasons for deploying web linked data, which would have been relevant especially if we had wanted to go beyond a proof-of-concept system.
But the point is, you can get a quick win if you make data available via a queryable web service. You save a lot of time and effort if data from multiple sources are semantically linked. But your data certainly don’t need to be perfectly and completely semantically aligned before you can start using them.
I’d like to leave flies now, and return to SKOS.
As you all know better than I do, one of the cornerstones of information retrieval for many years has been the development of controlled structured vocabularies, such as the Library of Congress Subject Headings, or the Dewey Decimal Classification, or the Agrovoc Thesaurus.
Ever since the advent of the Web, there has been a desire to make better use of these valuable tools, to help organise and connect information as it emerges from closed silos and is shared via the Web.
Hopefully, the Simple Knowledge Organisation System (SKOS) will go some way towards enabling that to happen.
SKOS provides a common, standard, data model, for controlled structured vocabularies like thesauri, taxonomies and classification schemes.
This means that, if you own or have developed a controlled vocabulary, and would like to make it available for others to use, you can use SKOS to publish your vocabulary as linked data in the Web.
Because SKOS is now a standard, your data will be linkable with other vocabularies published in a similar way, and (hopefully) compatible with a variety of different software systems.
If you were at DC2008, you would have heard Ed Summers talk about his work to deploy the Library of Congress Subject Headings as linked data, using SKOS.
His initial work was deployed at an experimental site, but since then, based on Ed’s work, the Library of Congress has deployed their new Authorities and Vocabularies Service at id.loc.gov.
The first service deployed there is, of course, the LCSH.
To explain what LOC have done, for each heading in the LCSH, LOC have minted a URI for that heading.
For example, the URI http://id.loc.gov/authorities/sh95000541#concept identifies the LCSH heading for the World Wide Web.
If you plug that URI into the location bar of your browser, you’ll get a conventional web page providing a summary of that heading.
With a small change in the way you make that request, you can also retrieve a machine-readable representation of that heading. I.e., you can get data. Those data are structured using SKOS.
Each heading is, of course, linked to other headings, so you could, if you wanted to, follow the links from one heading to the next, collecting data along the way.
Alternatively, if you want to re-use the entire LCSH in your own application, you can download the whole thing in bulk, again as data, structured using SKOS.
I hope what the Library of Congress have done with LCSH achieves three things.
First, I hope it means more people use the LCSH. For all it’s quirks, the LCSH, like many other vocabularies, is an invaluable resource, and I hope we will see it turned to wild and wonderful new uses.
Second, I hope it encourages other projects to re-use the LCSH URIs, to link their metadata records to LCSH via the Web. That would make it much easier to make use of links between metadata records across existing collections.
Third, I hope LCSH serves as a hub for linking other vocabularies as they emerge into the Web.
I think that, in a short time, it is not unrealistic to imagine that, we could see LCSH as a hub in a web of linked vocabularies, with that web of vocabularies itself serving as a hub for a much larger and broader web of linked metadata.
That this is possible is demonstrated by the fact that the LCSH has already been linked to another vocabulary, the French RAMEAU vocabulary, used by the Bibliotheque nationale de France.
A second piece of work I’d like to highlight is Michael Panzer’s work on dewey.info.
One other project I would like to highlight from within the DCMI community is the work of Jon Phipps and Diane Hillman on the NSDL metadata registry.
In fact, to call it a registry doesn’t, in my mind, do it justice, because it is a complete vocabulary development, maintenance and publication platform, built using SKOS.
And it doesn’t only cover vocabularies, it covers metadata schemas (a.k.a., element sets) too.
And not only does the registry make all of the underlying data accessible via normal HTTP requests, (which you could use to implement linked data), it provides a SPARQL endpoint too, so you can query the schemas and vocabularies however you like.
The LCSH, dewey.info, and the NSDL metadata registry, are just three examples of recent uses of SKOS, a good source for more is the SKOS Implementation Report.
Let me be the first to say that SKOS isn’t perfect. Neither does it cover every eventuality. While thesauri, classification schemes and subject heading systems do have something in common, they also exhibit diversity. In many cases, that diversity is not just a historical artifact, but exists for good reason, because the vocabulary is adapted for a specialised purpose.
Our goal with SKOS was to capture enough of this commonality to enable some interoperability, but to provide an extensible foundation from which different communities could innovate, and explore solutions to their own particular problems.
We’ve seen already, at this conference, for example, in Michael Panzer and Marcia Zeng’s presentation, how work is well underway to develop extensions to SKOS for classification schemes.
This photo was taken during a discussion of how to extend SKOS for the Japanese National Diet Library Subject Headings.
On this note, I’d like to share a small insight, which I gained while I was working on SKOS, thanks to my colleagues in the W3C Semantic Web Deployment Working Group.
When I started working on SKOS, I thought that developing a standard was about getting everyone to do the same thing. I.e., it was about uniformity.
Now, I have a different perspective.
Consider that, if the developers of the original Web standards had tried to think of every possible way the Web might be used, then tried to design a complete system to accommodate all those possibilities they could think of, the Web would probably not exist today, for two reasons.
First, they would still be here today, imagining new possibilities, and arguing about how to deal with conflicting requirements.
Second, they would have built a system that was too complicated.
Whether by intention, inspiration, or accident, the original Web standards have not led to uniformity, but have rather led to an explosion of innovation and diversity.
Thus, my insight is that, a good standard, at least for the Web, is one that provides a platform for innovation. It musn’t try to do too much. Of course, it must be clear about everything that is within its scope, and so provide a sound basis for interoperability. But it should be aggressive about limiting its scope. And it must be flexible and extensible, to accommodate differences, and to enable unexpected ideas to be realised.
Striking this balance is, of course, far from easy, and I have no way of knowing whether SKOS has found the right balance. However, the people I have met through my work on SKOS continue to be an inspiration, and I hope at the very least it will provide a stepping stone to the future.
RDA, FRBR and RDF
Of course, when it comes to sharing and linking metadata, controlled vocabularies and SKOS are only a small part of the picture.
We also need standards for sharing and linking the metadata itself, standards that provide a basis for interoperability but that also can, as Michael Crandall beautifully illustrated on tuesday, accommodate the richness and complexity of of our descriptions, and of the artifacts they describe, be they literary works, works of art, or the results of scientific inquiry.
Here, too, there are opportunities to build on a significant body of previous work.
Two such bodies of work are the Functional Requirements for Bibliographic Records (FRBR), and the Anglo-American Cataloging Rules (AACR), which is the precursor to the Resource Description and Access (RDA) specification.
Now, I am not an expert on bibliographic metadata, so I cannot comment on the details of these standards.
However, I can tell you that, it is possible to take the data models underlying FRBR and RDA, to publish those data models using Semantic Web standards, and then to use those models as a framework for transforming existing metadata records to RDF and publishing them as linked data in the Web.
Earlier this year, I did a very modest amount of work with the DCMI-RDA task group, proving the concept. Using a set of cataloging examples, and using the RDA elements schema developed by the task group, I developed some patterns for representing bibliographic metadata as RDF.
I then tested these on a larger scale, using a dump of just under 7 million MARC records from LOC. I showed that at least some of the metadata from the MARC records could be transformed to RDF using the FRBR schema and RDA elements schema and vocabularies.
Next steps for this work would be to increase the coverage of the converted data, and to publish it not only as linked data but also via queryable (e.g., SPARQL) web services, which would drastically reduce the barrier to re-use of this fantastic resource.
Data-Sharing Networks for Malaria Research
I’d like to conclude my talk by returning to scientific research.
In June this year, after the FlyWeb Project finished, I moved up the hill in Oxford to join the Centre for Genomics and Global Health, which is a joint research programme of Oxford University and the Wellcome Trust Sanger Institute, directed by Prof. Dominic Kwiatkowski.
The main focus of our research is to assist the global campaign to eliminate malaria.
According to the WHO’s World Malaria Report 2008, half the world’s population is at risk of malaria, and an estimated 247 million cases led to nearly 881,000 deaths in 2006. Small children remain by far the most likely to die of the disease.
The recent advances in biotechnology that I mentioned earlier, in particular the rapid advances in DNA sequencing and genotyping technology are, of course, being brought to bear on the problem.
One of the most promising approaches is called genomic epidemiology, which combines genetic data from the lab with clinical data from the field, to understand why, for example, some people are less susceptible to serious infection than others. It is hoped that, by understanding the natural mechanisms of protective immunity against malaria work, this may contribute to the development of an effective malaria vaccine.
The crunch is that this type of research cannot be done on a small scale. Because genomic epidemiology involves analysing hundreds of thousands of points of variation in the human genome, and searching for associations between these genetic differences and different disease outcomes, a large number of samples (i.e., patients) need to be included, to gain the necessary statistical power to find genuine associations.
Thus genomic epidemiology requires research collaboration on an unprecedented scale. And the key to enabling this type of research is data-sharing.
Will the Semantic Web and linked data help? I hope so. The Web, in all its chaos and diversity, will certainly play a pivotal role. But the challenges are too broad to be solved by one family of technologies alone.
Many of the key challenges are social, rather than technological.
For example, enabling a scientific community (i.e, one not trained in data modeling) to quickly reach working agreements on data standards, and enabling scientists to translate between their own view of their data, and a standardised view of their data, is vital. Here, bridging the gap between the technology and the people has never been more important.
Similarly, reaching agreements on when data may be shared, and on how it may be used, is key. Because, of course, in addition to ensuring credit is received for individual scientific research, personal genetic and clinical data is highly sensitive, and there are strict ethical rules about data use and privacy.
In spite of these challenges, I remain hopeful. I am especially encouraged by the openness of communities like this one, and on the willingness of those communities to share their experience and expertise.
With that, I hope you enjoy the rest of the conference, and thank you for listening.