Alistair Miles

Blog, migrated

I’ve moved to using GitHub pages for blogging. This site is no longer active.

MalariaGEN Informatics Blog

This is just a short post to say my colleagues and I have started a MalariaGEN Informatics Blog, that’s where I’m mostly posting at the moment.

Using SPARQL for Biological Data Integration – Reflections on openflydata.org and the FlyWeb Project

It’s now almost 18 months since the end of the FlyWeb project and the development of the proof-of-concept site openflydata.org, so I thought it was high time to write up a few reflections. Thanks to Dr David Shotton, head of the Image Bioinformatics Research Group, for giving me the chance to work on FlyWeb, it was a great project.

If you want to know more about the technical side of the work, see the paper “OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster” in the Journal of Biomedical Informatics.

Integrating Gene Expression Data

We wanted to help reduce the amount of time spent by a Drosophila functional genetics research group on experimental design and on validating experimental results. Experimental design includes selecting genes that might be relevant to the biological function of interest (in this case, male fertility). Validating results includes checking your gene expression data against other published data for the same gene – a discrepancy suggests an artifact or problem in your data, or in the published data … either way it’s likely to be important.

The problem here is a common refrain – the relevant data are not all found in the same place. Trawling and querying multiple sites and manually compiling the results takes a lot of time. Could we build some tools that help bring these data together?

Technology Hypothesis – Data Webs, Semantic Web & SPARQL

We wanted to be as user-driven as possible, i.e., to stay focused on what the researchers needed, and to be open-minded about technology, using whatever tools made us most productive.

But we did have a technology hypothesis, which was part the reason why JISC funded the FlyWeb project. Our hypothesis was that building a data integration solution for our Drosophila researchers using Semantic Web standards and open source software would be (a) feasible, and (b) reasonably cost-efficient. David Shotton and Graham Klyne, the original proposal authors, had also previously developed a vision for “data webs“, an architectural pattern for integrating a set of biological data within a specific domain or for a specific purpose. Would the data webs pattern help us to build a solution?

So how did we marry these two forces: on the one hand being user-driven, on the other having a technology hypothesis that we wanted to test?

Well, in the spirit of agile, we tried to make our development iterations as short as possible. I.e., we tried to work in a way that meant we had something to put in front of users in the shortest possible time frame. When we were discussing architectural patterns and technologies, and several alternatives looked to be of similar complexity or difficulty, we favoured approaches involving RDF, OWL or SPARQL, and that were closer to the original data webs vision.

However, our goal was not to prove that a solution based on semweb standards and tech was any cheaper or better than a non-semweb alternative; just that it was possible and not prohibitively expensive. This is interesting because, if data integration solutions in related problem domains were all based on semweb standards, then they also might play together, as well as solving their own particular problems … or so the argument goes. I.e., there would be some re-use or re-purposing benefit from each individual data integration solution, and maybe some network effect, if everyone used SPARQL, for example. Of course there would be work involved in linking two data web solutions, but it might be less because at least we’d bottom out at the same standards – RDF, SPARQL, and maybe even some shared ontologies.

But you can’t even begin to talk about network effects if you can’t first show that you can solve specific problems effectively and cheaply. I.e., solutions need to make sense locally.

SPARQL Mashups

An architectural pattern that we adopted early in the project was the “SPARQL mashups” pattern. A SPARQL mashup is an HTML+JavaScript application that runs entirely in the browser, and that retrieves data directly from two or more SPARQL endpoints via the SPARQL protocol.

To see a SPARQL mashup in action, go to http://openflydata.org/search/gene-expression, click the “show logger” link at the bottom of the page (or open firebug if you’re in firefox), then type “schuy” into the search box. You should see SPARQL queries being sent to various SPARQL endpoints and result sets being returned in the JSON SPARQL results format.

For example, here’s the query that finds genes from FlyBase matching the query term “aly”:

# Select feature short name, unique name, annotation ID, and official full name, given 
# any label and where feature is D. melanogaster gene.

PREFIX xsd: 
PREFIX chado: 
PREFIX skos:     
PREFIX so: 
PREFIX syntype: 

SELECT DISTINCT ?uniquename ?name ?accession ?fullname WHERE {

  ?feature skos:altLabel "aly" ; 
    a so:SO_0000704 ;
    chado:organism  ;
    chado:uniquename ?uniquename ;
    chado:name ?name ; 
    chado:feature_dbxref [ 
      chado:accession ?accession ; 
      chado:db <http://openflydata.org/id/flybase/db/FlyBase_Annotation_IDs>
    ] .

  OPTIONAL {
    ?fs 
      chado:feature ?feature ; 
      chado:is_current "true"^^xsd:boolean ;
      chado:synonym [ 
        a syntype:FullName ;
        chado:name ?fullname ; 
      ] ;
      a chado:Feature_Synonym .
  }

}

Each of the panels in the UI corresponds (more-or-less) to a data source. The search term is first used in a SPARQL query to the FlyBase endpoint to find matching genes. If there is only a single gene matching the query, the gene is automatically selected, and further SPARQL queries are then sent to other data sources (e.g., FlyAtlas, BDGP, Fly-TED) to retrieve gene expression data relevant to that gene. If more than one gene matches the query (e.g., try “aly”) the user has to select a gene before the next set of queries are dispatched.

Why did we use the SPARQL mashup pattern?

Well, it allowed us to use some off-the-shelf open source software. All we had to do was code a transformation from the data in its published format to RDF. Once we had an RDF dump for each data source, we loaded the data into a triple store (we used Jena TDB) then deployed the store as a SPARQL endpoint via a SPARQL protocol server (we used Joseki initially, then SPARQLite).

Once we had a SPARQL endpoint for each data source, we could develop a simple HTML+JavaScript application in the spirit of a conventional mashup, using the SPARQL protocol as the API to the data.

A nice feature of using SPARQL here is that you don’t have to think about the API to the data, at least not from a web service point of view. The SPARQL protocol and query language basically give you an API for free. All you have to figure out is what query you need to get the right data. And you don’t need to write any code on the server side, other than that required to transform your data to RDF.

Also, because your API supports a query language (SPARQL), you don’t need to know up-front exactly what data you need or what questions you’re going to ask (although obviously it helps to have a rough idea). I.e., if you get half-way through coding your mashup and realise you need to query the data in a different way, or retrieve more or less data, you just tweak the SPARQL query you’re sending. I.e., there are no consequences for your server-side code, your API can already handle it.

This also means your API can handle unanticipated use cases. I.e., if someone else wants to query the data for a completely different purpose, chances are they can already do it – the expressiveness of SPARQL means that the chances others will be able to use your data are high. Although this wasn’t a motivation in our project, we liked the idea.

Dealing With Big(-ish) RDF Data

As we scaled up from initial prototypes, we hit a few snags. The biggest challenge was dealing with the FlyBase data, which amounted to about 180 million triples in our final dump. Also, queries had to be quick, because users of the mashup apps are waiting for SPARQL queries to evaluate in real time. Here’s a few tricks we found for working with RDF data at this scale.

  • Fast data loading – For data loading, we found we could get between 15,000 and 30,000 triples per second from Jena TDB on a 64-bit platform. That meant the FlyBase dataset loaded in somewhere between 1.5 and 3 hours. To load the data, we fired up a large EC2 instance, and loaded the data onto an EBS volume. When the load was done, we detached the volume and attached it to a small instance which hosted the query endpoint, and shut down the large instance to keep running costs down. We didn’t try this, but using a RAID 0 array and striping your data across multiple EBS volumes might increase load performance even further (there’s a nice article by Eric Hammond on using RAID 0 on EC2).
  • Everything has to be streaming – The transformation from source format (e.g., relational database) to RDF has to be streaming. The SPARQL query engine has to be streaming. And the SPARQL protocol implementation has to be streaming. That’s part of why we rolled our own SPARQL protocol implementation in the end (SPARQLite) – Joseki at the time did not write result sets in a streaming fashion, for valid reasons, but which limits scalability.
  • To get good query performance we pre-calculated some data. E.g., when we wanted to do a case-insensitive match against RDF literals in a query pattern, we computed the lower-case version of the literal and added it to the data as extra triples, then wrote queries with literals in lower case too – rather than, say, using a regex filter. SPARQL queries go much faster when they have a concrete literal or URI node to work from early in the query; queries with loose patterns and FILTERs can be very slow, because you’re pushing a lot of triples through the filters. We did also try using the SPARQL-lucene integration (LARQ) for text matches, but couldn’t get this to quite fast enough (sub 3s) for the FlyBase gene name queries, although it was used heavily in some other projects (CLAROS and MILARQ). You can also make queries go faster by shortening query paths. E.g., if you have a pattern you want to query like { ?x :p ?y . ?y :q “foo”. } your query may go faster if you first invent a new predicate :r and compute some new triples via a rule or query like CONSTRUCT { ?x :r ?z } WHERE { ?x :p ?y . ?y :q ?z. }, then add these triples to your dataset and query using the pattern { ?x :r “foo” } instead.
  • Beware that how you write your query may make a difference. Depending on which optimiser you use, TDB will do some re-ordering of the query to make it go faster (I believe to put more selective bits earlier), but if you know your data well (statistics are helpful) then writing the query with this in mind can help the query engine. E.g., if you have a triple pattern with a specific predicate and a specific subject or object that you know should only have a few matches, put this right at the top of the query. Basically, put the most discriminating parts of the query as early as possible. This also means that often triple patterns with rdf:type are not that helpful early on, because they don’t narrow down the results much, although this is what you tend to put first for readability.
  • Test-driven data – When you generate a large RDF dataset, you need to be sure you got the transformation right, and the data is as you expect it to be, otherwise you can waste a lot of time. I.e., you need to be able to test your triples. We designed some simple test harnesses for our data, where a set of test SPARQL queries were run against the data. Each SPARQL query was an ASK or SELECT and the test case defined an expectation for each query result. For very large datasets, you may also want to code some sanity checks on the n-triples dump before trying to load into a triplestore and test with SPARQL, e.g., scanning with grep and/or awk to find triples with predicates you expect to be there.

Open SPARQL Endpoints – Mitigating Denial of Service

Above I mildly extolled the virtues of SPARQL as an API to data – anyone can write the query they need to extract the data they want, and you don’t need to anticipate all requirements a priori.

The obvious downside to the expressiveness of SPARQL and openness of SPARQL endpoints is that they are vulnerable to accidental or intential denial of service attacks. I.e., someone can write a hard query and tie up your query engine’s compute and/or memory resources, if not crash your box.

Although deploying a production service or guaranteeing service levels wasn’t part of our remit, we were concerned that unless we could mitigate this vulnerability, SPARQL outside the firewall would never really be useful beyond a proof-of-concept. I.e., we would never be able to advertise our endpoints as a production web service, so that others could write mashups or other applications that query the data and depend on the service.

We spent a bit of time working on this, and this may be a solved problem now in newer query engines, but at the time we thought to place some limits on the queries that open endpoints would accept. For example, SPARQLite endpoints could be configured to disallow queries with triple patterns with variable predicates, or FILTER or OPTIONAL clauses, or to enforce a LIMIT on all queries’ result sets. This is not a complete solution, because you could still write hard queries, but at least it removed some of the obvious attacks. A better solution would probably involve monitoring queries’ resource usage and killing any that take too long or consume too much resources – a bit like how Amazon’s SimpleDB places limits on service usage, including a 5 second maximum query execution time.

Mapping to RDF

The elephant in the room here is mapping the data to RDF, and that’s where a lot of the work went. All of our data sources came in some non-RDF format, either as CSV files or a relational database. For the CSV sources we hand-coded RDF transformations as Python scripts. For the relational databases, we made heavy use of D2RQ, although we did not use D2R server to transform SPARQL queries to SQL on-the-fly due to performance and scalability issues, rather we used the D2R dump utility to generate a complete RDF dump of an SQL datasource in n-triples format, then loaded that into a Jena TDB triplestore which backed our SPARQL endpoints.

The main issue was the time it takes to design a mapping from a fairly complex relational schema like Chado to RDF. Rather than trying to find one or more existing, published, ontologies to use in the RDF outputs of the mapping, and designing the mappings by hand, we tried a different approach. Inspired by model-driven engineering, we developed a Python utility which, driven by some simple annotations on the source SQL schema definition, generated both a suitable OWL ontology and a complete D2RQ mapping file. This worked well with a schema like Chado which has consistent structural patterns and naming conventions. There’s a worked example in the supplementary information (S4) to the OpenFlyData paper in the Journal of Biomedical Informatics.

The problem with this approach is, of course, that you end up with one schema/ontology per data source. Initially we thought this would force us to do some ontology alignment and to map everything to a common ontology, but we quickly realised this just wasn’t necessary. The mashup applications quite happily query each source according to its own ontology, and have just enough knowledge of what each ontology means to integrate the results in a sensible way. I.e., you can develop applications that work with multiple data sources without perfect (or even partial) ontology alignment. Obviously, aligning ontologies is desirable, but that can be a long-term ambition – using ontologies derived from the source data at least gets you started, and gets you talking about data semantics rather than getting bogged down by differences in syntax, formats or protocol (because RDF and SPARQL are the interlingua for these).

Lasting Impressions

The message I took away from this project is that, if you already have some data, and you want to make the data available to web application developers and other hackers in a useful way, then SPARQL can be a good option. It’s fairly straightforward (even, dare I say, fun) to code simple HTML+JavaScript mashups that bring data from different SPARQL endpoints together on-the-fly (pardon the pun). SPARQL won’t be a panacea, and you may find some queries just aren’t quick enough to evaluate in real time, so you may have to find ways to optimise these queries when moving to production, but it’s worth doing some benchmarking, as triplestores like Jena TDB are quick for certain types of query.

The pain comes when you need to convert data to RDF. But you don’t need to get hung up on finding the right ontologies or designing a perfect or even complete mapping. Convert what you need, using a custom ontology that is designed for your application or generated from the source data, and just get going – you’ll have plenty of iterations to refactor the data.

Would I use SPARQL again? Yes, for read-only data services and data integration webapps, I’d definitely consider it. And there are some new features coming in SPARQL 1.1 which look very useful. If someone solves the denial-of-service problem for open SPARQL endpoints (and they may already have) then the case for SPARQL as a data-sharing standard is compelling. Certainly an area to watch.

Apache, Authentication and MySQL

I just spent a couple of hours trying to configure an Apache 2.2 server to do BASIC authentication using a MySQL database of usernames and passwords. The standard way to do this is via the mod_auth_mysql module, but much of the documentation on the web is out of date or has some hidden gotchas. Here is what I got to work.

For reference, I’m using Ubuntu 10.04 with all software installed via APT (apache2, mysql-server, libapache2-mod-auth-mysql).

To install mod_auth_mysql…

$ sudo apt-get install libapache2-mod-auth-mysql
$ sudo a2enmod auth_mysql 

The biggest gotcha is that the configuration documentation for mod_auth_mysql is badly out of date. There have been some substantial changes to the configuration parameter names since that was written, although I could not find any definitive documentation of the new configuration parameters. There are a couple of other gotchas in there too, I’ll come to those in a minute.

Before configuring Apache, I set up a test database of usernames and passwords. This is what I did…

$ mysql -uroot -p
mysql> grant all on auth.* to auth_user@localhost identified by 'XXX';
mysql> flush privileges;
mysql> create database auth;
mysql> use auth;
CREATE TABLE user_info ( user_name CHAR(100) NOT NULL, user_passwd CHAR(100) NOT NULL, PRIMARY KEY (user_name) );
INSERT INTO `user_info` VALUES ('test', MD5('test'));
CREATE TABLE user_group ( user_name char(100) NOT NULL, user_group char(100) NOT NULL, PRIMARY KEY (user_name,user_group) );
INSERT INTO `user_group` VALUES ('test', 'test-group');

Note the length of the user_password field. 100 characters is probably more than needed, but you will definitely need more than the 20 characters suggested in some documentation if you want to use a password hash like MD5. (If the field is too short, then password hashes will get truncated when they’re inserted into the database.)

Then I configured mod_auth_mysql to authenticate users for my whole domain. In the appropriate virtual host configuration file (e.g., /etc/apache2/sites-enabled/000-default) I added the following…

<Location />

# these lines force authentication to fall through to mod_auth_mysql
AuthBasicAuthoritative Off
AuthUserFile /dev/null

# begin auth_mysql configuration
AuthMySQL On
AuthMySQL_Host localhost
AuthMySQL_User auth_user
AuthMySQL_Password XXXX
AuthMySQL_DB auth
AuthMySQL_Password_Table user_info
AuthMySQL_Username_Field user_name
AuthMySQL_Password_Field user_passwd
AuthMySQL_Empty_Passwords Off
AuthMySQL_Encryption_Types PHP_MD5
AuthMySQL_Authoritative On
#AuthMySQL_Non_Persistent Off
#AuthMySQL_Group_Table user_group
#AuthMySQL_Group_Field user_group

# generic auth configuration
AuthType Basic
AuthName "auth_mysql test"
Require valid-user

</Location>

Note the “PHP_MD5” encryption type. (Some of the documented encryption types don’t seem to be available, e.g., “MD5”.)

Then…

$ sudo apache2ctl -t # check syntax
$ sudo apache2ctl restart

Then when browsing to the host, I get an authentication challenge, and can log in with username “test” and password “test”.

Using mod_authn_dbd Instead

There is another way to get Apache to use a relational database to look up usernames and passwords when authenticating – mod_authn_dbd. That module seems more current and has up-to-date documentation, see e.g., the Apache 2.2 mod_authn_dbd module docs and the Apache 2.2 docs on password encryption.

Note however that you cannot use normal MD5 encryption to store passwords in the database with this module. If you want to use MD5 you have to use the special Apache MD5 algorithm.

Also note that to get this working with MySQL you will need to install the MySQL driver for DBD, which you can do via APT:

$ sudo apt-get install libaprutil1-dbd-mysql

If you get a message like “DBD: Can’t load driver file apr_dbd_mysql.so” then this is what you need to do – don’t believe the articles that tell you you need to recompile APR 🙂

Configure Exim4 on Ubuntu to use GMail as Smart Host

This is just a short post to say that, to configure exim4 to use gmail as a smart host on Ubuntu 9.04, I did only the following, and no more…

user@host:~$ sudo dpkg-reconfigure exim4-config

Choose mail sent by SMARTHOST; received via SMTP or fetchmail.
Machine handling outgoing mail for this host (smarthost):

smtp.gmail.com::587

(All other questions I left as default.)
Then…

user@host:~$ sudo emacs /etc/exim4/passwd.client

…and add the following line:

*:yourAccountName@gmail.com:y0uRpaSsw0RD

Please note, I know next to nothing about exim4 configuration, so caveat emptor.

CGGH and Data-Sharing Networks: Background

This post provides a bit of background to my current work on research data-sharing networks, as a member of the Centre for Genomics and Global Health (CGGH).

Centre for Genomics and Global Health

The Centre for Genomics and Global Health (CGGH), a joint research programme of Oxford University and the Wellcome Trust Sanger Institute, is tasked with providing support for data-sharing networks that enable clinicians and researchers around the world to collaborate effectively on large-scale research projects.

MalariaGEN

The largest data-sharing network that CGGH currently supports is MalariaGEN, the Malaria Genomic Epidemiology Network. MalariaGEN is a partnership of researchers in 21 countries who are using genomic epidemiology to understand how protective immunity against malaria works, which is a fundamental problem in malaria vaccine development. MalariaGEN has been operational since 2008.

CGGH acts as the MalariaGEN Resource Centre, providing scientific and operational support for MalariaGEN’s research and training activities. A key aspect of this operational support is the design, development and hosting of Web-based information systems that are used by MalariaGEN to manage data shared by MalariaGEN’s research partners. CGGH previously developed and currently hosts a system called Topheno, which is the system used by MalariaGEN to manage data sharing. Many lessons have been learned in the development and use of Topheno, and much my current work is building on that experience.

WWARN

A second data-sharing network that CGGH supports is WWARN, the World-Wide Antimalarial Resistance Network. WWARN is a global collaboration working to ensure that anyone affected by malaria receives effective and safe drug treatment. WWARN’s aim is to provide quality-assured intelligence, based on the balance of currently-available scientific data, to track the emergence of malarial drug resistance. WWARN is due to begin operations in the first half of 2010.

CGGH has responsibility for WWARN’s scientific informatics module, which includes in its scope the design, development and hosting of Web-based information systems to support WWARN’s data-sharing operations. These systems are currently under development.

Common Features of WWARN and MalariaGEN

There are some key similarities between WWARN and MalariaGEN.

In both cases, the operational workflow begins with the submission of original research data, usually by a researcher who is/was involved in the study from which the data originates, acting from their host institution (usually a university).

In both cases, data are submitted to the network from a distributed community of researchers. In the case of MalariaGEN, the set of researchers submitting data to MalariaGEN is delimited by the set of partners who have signed up to one of MalariaGEN’s Consortial Projects. For WWARN, the set of researchers submitting data is envisaged to be slightly more open-ended, with researchers submitting any original data that is relevant to one of WWARN’s four scientific modules.

In both cases, data are not primarily captured for submission to MalariaGEN or WWARN, but are captured as part of an independently funded original research study. Each study from which data originates has its own scientific objectives, which may be related to the objectives of the data-sharing network, but if so are usually more specific and finer-grained. The subjects for each original study are usually drawn from at most a handful of locations within a single country. The data-sharing networks then work to aggregate the data from many independent studies, in a reliable and scientifically valid manner, to conduct coarser-grained analyses across larger scales of time, space and biology than are considered in any one original study.

This last point has a number of important consequences. For example, because data are being primarily captured for an original research study, and not for the data-sharing network’s secondary analyses, the network is not in a position to mandate the manner or format of data collection and representation. Data may be collected for a range of purposes using different means and a diversity of representations. The data-sharing network must learn to deal with this heterogeneity, and this forms a large part of the network’s data management operations.

Also, because the data-sharing network is not the primary-endpoint for the data, those involved in the secondary analysis of shared data typically have to cajole researchers into submitting their data, because doing so means time out from their primary research activities. Therefore, the data-sharing network wants to minimise the obstacles it presents to those submitting data, and to find ways in which they can add value for the submitters’ primary research, even though that research will not be perfectly aligned with the secondary research activities and goals of the network as a whole.

Other Data-Sharing Networks

In addition to MalariaGEN and WWARN, CGGH is also involved in supporting an informal network of researchers working on the malaria parasite (Plasmodium) genomes. Here the main focus is on generating and analysing detailed genome sequence data using next-generation sequencing technology, although there may also be a need to share and aggregate other, related data. Finally, CGGH is involved in the UKCRC Modernising Medical Microbiology project, which may involve management of data from a number of different sources, although some of these sources will have been collected for health reasons and not for research.

Thus, involvement in data-sharing networks is a fundamental feature of CGGH’s activities. Although CGGH’s involvement is far from limited to informatics, and also encompasses sample and data management, statistics, ethics and programme management, nevertheless a key responsibility is the development of Web-based information systems that support the operational activities of a data-sharing network. It is also worth noting that the development or extension of those systems is often the rate-limiting step in establishing a data-sharing network or enabling it to adapt to a new type of data or analysis.

Generic Information Systems for Data-Sharing Networks

It is thus of urgent strategic importance to CGGH to identify those requirements for information systems that are common across these data-sharing networks. Once these requirements are understood, we need to identify a set of existing software and services that can be adopted and deployed to fulfil those common requirements. The underlying driver is to minimise the amount of time and effort spent on designing, developing and running common infrastructure, and thus make available as much effort as possible to deal with those requirements that are unique to the scientific activities of a particular data-sharing network.

What Next?

We’re working to identify and document many of the key requirements that are known to be common at least between MalariaGEN and WWARN. This work should hopefully then provide a basis for finding and evaluating existing software and services, and for designing a reference architecture which provides the highest possible point of departure for developing information systems for each data-sharing network. We’re going to need lots of help with that, so please feel free to contact me if you think you might be able to help.

DC2009 Talk Notes – Towards Semantic Web Deployment: Experiences with Knowledge Organisation Systems, Library Catalogues, and Fruit Flies

First, let me say that it has been my pleasure to attend Dublin Core conferences since 2005 in Madrid.

Thanks to the organisers here for putting on a great conference, and for inviting me to give this talk, it is a real honour.

I will post notes from this talk on my blog at purl.org/net/aliman – so don’t worry if you miss anything.

It is also excellent timing, as after at least 5 years of talking about it, I can finally say that, on 18 August 2009 SKOS – the Simple Knowledge Organisation System – was published as a W3C Recommendation, thanks in large part to the experience and support of this community.

I’d like to talk more about SKOS later in this presentation.

Before I get into the main body of my talk, let me say first that, one of the lessons I’ve learned in working on Semantic Web Deployment, especially in the last 2 years, is that big ideas can look great when viewed from a distance, but the reality on the ground will always be far more complex and surprising than you anticipate, as I’m sure many of you can testify to.

In particular, the cost/benefit trade-offs for investing in a particular technology or approach can vary wildly from one situation to the next. Nothing is more important than keeping an open mind.

So, although I love talking about this big ideas, I’m going to try to put them to one side for this morning (although, I’m sure I’ll still be tempted by one or two). Instead, I’d like to simply share a few experiences, and hopefully give you at least a flavour of where some of the opportunities and the challenges might lie.

Now, given that we’ve been talking about metadata for at least two days now, I thought I’d spend a few minutes talking about something completely different.

Fruit Flies

These are fruit flies, of the species Drosophila melanogaster.

On the left is a male, on the right a female. Females are about 2.5 millimeters long, males are slightly smaller.

When I first saw this image, I thought it was a photograph. But, when I looked closer, I noticed it was in fact a drawing, and it was signed “E Wallace”.

Edith Wallace, I found out, was curator of stocks and artist for Thomas Hunt Morgan, who c. 1908 began using fruit flies in experimental studies of heredity at Columbia University.

Heredity simply means the passing of biological traits, such as green eyes or brown hair, from parents to offspring.

I’d like to talk more about Thomas Hunt Morgan, but before I do, it’s worth noting that when Charles Darwin published his book On The Origin of Species by Means of Natural Selection in 1859, which of course depends in its central argument on the inheritance of traits from one generation to the next, the underlying mechanisms of that inheritance, i.e., of heredity, were completely unknown.

It was, of course, Gregor Mendel’s work on pea plants, published first in 1865, but not widely known until the turn of the 20th century, which first suggested that organisms inherit traits in a discrete, distinct way. It is these discrete units or particles of inheritance that we now call genes.

However, at the turn of the century, although the theory of genes was accepted, it still was not known which molecules in the cell carried these genes.

It was T. H. Morgan’s aim to determine what these molecules were.

He needed a suitable animal to study, and chose Drosophila, primarily because of cost and convenience – they are cheap and easy to culture, have a short life cycle and lots of offspring.

Morgan was looking for heritable mutations to study, and spent a long time looking before he discovered a few flies with white instead of the usual red eyes. The white-eyed mutation only occurs in male flies, i.e., it is a sex-linked trait. The fact that the trait depends on the sex of the individual suggested to Morgan that the genes responsible for the mutation reside on the sex chromosome … and that the chromosomes generally are the carriers of genetic information.

Morgan and his students also discovered that some traits, such as wing length and eye shape, do not get mixed up randomly from one generation to the next, but rather tend to be inherited together.

They demonstrated that the reason why these traits tend go together is because the genes responsible are on the same chromosome, and are in fact quite close together on the same chromosome, and hence tend to stay together when chromosomes recombine during formation of sperm and egg.

They then used these observations to construct the very first genetic maps, that is, maps of where different genes are in relation to each other on the chromosomes. This was the first crucial step in understanding how a genome is organised.

Morgan started something of a trend when he chose Drosophila, because for the last century Drosophila has been one of the most intensively studied model organisms.

It is fair to say that much of what we know today about genetics, and the molecular mechanisms underlying development, behaviour, aging, and many other biological processes common to flies and humans, we know from research on fruit flies.

In the last decade, that research has entered an entirely new phase.

In 2000, the complete genome sequence of Drosophila melanogaster was published. For the first time, we gained a complete picture of the location and DNA sequence of every gene in the genome. Publication of those data has unlocked entirely new methods and avenues of scientific investigation.

However, many questions remain unanswered.

What do all those genes actually do?

Where and when do they play a part in various biological processes?

And, especially, how do genetic differences between individuals relate to different biological outcomes?

E.g., translating that question from flies to humans, why do individuals with one genotype tend to be resistant to a disease such as malaria, whereas others don’t?

Answering these types of questions is the domain of functional genomics.

A functional genomic study typically asks, in relation to some biological process like development of sperm or the eye, …

What genes are active, where, when and how much?

Which genes are interacting with each other?

What happens if you stop a gene from working (knock it out), then restore it?

For example, this image shows that a particular gene called schumacher-levy is active during the developing Drosophila embryo, and is localised to a specific organ, in this case the developing gonad.

Other advances in biotechnology over the last decade have revolutionised this type of research.

For example, we now have a number of tools for carrying out high-throughput experiments. These are experiments where, rather than looking at just a handful of genes, we can look at 100s or even 1000s of genes at a time.

The bottom line: these high-throughput technologies, in addition to rapid advances in genome sequencing technology, generate a very large amount of highly heterogeneous data.

And our ability to generate new data on unprecedented scales is accelerating. Really, the pace of change is staggering.

To give you an idea of the rate and scale of new advances, consider that, in July 2008 the Wellcome Trust Sanger Institute, the primary institute for DNA sequencing in the UK, announced that, every two minutes, they produce as much DNA sequence as was deposited in the whole first five years of the international DNA sequence databases, from 1982 to 1987.

Finding, comparing, and integrating these data is a critical challenge for ongoing biological research in all organisms, and is especially critical in translating findings from model organisms such as Drosophila to human health.

Because this is such a critical problem, much excellent work has already gone into making Drosophila data publicly available and accessible.

For example, FlyBase provides access to primary genome sequence data on all sequenced Drosophila species. It is the primary reference point for all Drosophila genome-related data.

FlyBase also establishes a controlled vocabulary for all Drosophila genes, which is a vital tool in integrating data from multiple sources, because genes are often the point of intersection between data sources.

BDGP embryo in situ database is an example of a database holding the output of high-throughput functional genomic studies. They publish images that depict the expression of a gene within the developing fly embryo, at various stages during the course of embryo development, for thousands of genes.

FlyAtlas also publishes data from high-throughput experiments, but using a different technique. They use DNA microarrays to get quantitative data on gene expression in different tissues, so the data tell you not only whether a gene is active, but also how much it is being expressed.

Much of the focus has, to date, been on providing a direct user-interface to each source of data. So each data provider has a set of web-based tools for a human researcher to query and visualise the data.

However, the focus is shifting towards enabling data to be harvested and integrated across databases in an automated way, because it is recognised that this could save much time and effort, and because some questions just couldn’t be answered any other way.

And there are projects making headway here, for example, FlyMine uses a conventional data warehouse approach to integrate data on Drosophila. However, it is no small challenge, and there is a pressing need to make the end products much more usable, and because public funding will always be scarce, to make the whole process as cost-effective, scalable and sustainable as possible.

FlyWeb Project

With this in mind, in January 2008, I moved to the Zoology Department at the University of Oxford, to work with a team there led by Dr David Shotton on a small project called FlyWeb.

FlyWeb asked two questions…

1. Can we build tools for Drosophila biologists that cut down the effort required to search across different sources of gene expression data?

2. Under the hood, what (semantic) web tools and design patterns help us to build cross-database information systems, and ensure that they are robust, performant and quick to build.

To answer these question, we set about building a proof of concept, which is deployed at openflydata.org.

How does openflydata.org work?

Each of the cross-database search applications I’ve demonstrated is, simply, a mashup.

It is a light weight, HTML+JavaScript application that runs in the browser, that fetches data in real time from several different web service endpoints.

Why did we use the mashup approach?

Simplest thing we could think to do. Also gave us flexibility to experiment with semantic web technology, without totally committing to it. I.e., we could mix semantic web and other solutions, if it proved easier to do so.

The bottom line was, we wanted to produce useful, compelling functionality for a biological researcher, in a reasonable time frame and with a reasonable level of performance and reliability. If semantic web helped, great, if not, fine, we’ll try something else.

Of course, it would have been great if each data source had provided a web service endpoint for their data with the necessary query functionality … but they didn’t, so we made some ourselves.

The approach we took was, for each data source, we first converted the data to RDF (the Resource Description Framework, one of the key Semantic Web Standards), then loaded the data into an off-the-shelf open-source RDF storage system (Jena TDB written by Andy Seaborne, now at Talis), then mounted each RDF store as a SPARQL endpoint.

What is a SPARQL endpoint?

It is, simply, a web service endpoint that uses the SPARQL protocol as its interface (API).

What is the SPARQL protocol? It is a simple HTTP-based protocol for sending SPARQL queries to be evaluated against a given data store.

What are SPARQL queries? SPARQL is a query language, roughly analogous to SQL (read only), but built for the Web. Basically, it gives you a way to ask pretty much any question you want to ask of a given set of data.

Why did we use RDF & SPARQL?

1. Rapid prototyping. Leverage off-the-shelf, open source, software, such as Jena, or Mulgara (see David Wood). In principle, we only had to write the software to convert the data from each source to RDF. We could then use OTS software to deploy an RDF database and SPARQL endpoint. (Caveat…)

2. SPARQL is a simple protocol with an expressive query language. That means its easy to write code for, but gives the client application (in our case, the mashups) the power to ask any question it wants. (Caveat…)

Point 2. also means we can offer these SPARQL endpoints as a service to the biological community, so others with a bit of savvy can ask their own questions. In particular, they can ask questions that we (the service provider) hasn’t thought of, which (in theory) promotes innovative re-use and exploitation of the data.

You’ll noticed I haven’t yet talked about either of the two key themes of this conference, semantic interoperability or linked data.

I haven’t mentioned these yet because, the point I’d like to make is that, depending on the context, there *may* be compelling, practical, short-term reasons to evaluate Semantic Web-based technology for a data integration project …

1. The (relative) ease of deploying a web service endpoint for querying a data source; i.e., of making the data accessible, lowering the barriers to re-use.

2. The (relative) simplicity of exploiting those web services to prototype light weight cross-database search and on-the-fly data integration applications. (pardon the pun)

Caveats…

Of course, the best choice of technology depends greatly on the existing technological context, and on the expertise available to you. As with relational databases, XML, or any family of related technologies, becoming productive with a new technology requires an investment in terms of people, training and time.

A second caveat is that, the more open and expressive a query protocol like SPARQL is, the harder it becomes to guarantee the performance and availability of a service using that protocol. It is a denial-of-service type problem. If anyone can ask any question they like, some people will ask hard questions, whether intentionally or by accident, which could degrade service performance for others.

SPARQL is thus a double edged sword. On the one hand, its open nature and expressivity is a major advantage, but that openness creates challenges when it comes to providing reliable and performant services to others.

I see resolving this tension as one of the key challenges for the semantic web community in making the technology widely applicable. We explored some strategies for mitigating these issues in FlyWeb, but we certainly did not find all of the answers.

Now, let’s talk about semantic interoperability and linked data.

The two classic problems you encounter when integrating data from different sources are…

1. schema alignment … each data source has a different data model … they structure their data in a different way … using different names for similar types of entities and relationships … or using similar names for what are actually very different types of entities and relationships …

2. coreference resolution … each data source may use different identifiers for the same thing (e.g., genes) … or (more rarely) identifiers may clash …

Our approach to schema alignment was not to try to completely align all our data sources in a single step.

Rather, we tried to pick the low-hanging fruit, and take an incremental approach, making use of existing data models as much as possible.

So, for example, the data from FlyBase come from a relational database, which is structured according to a relational schema developed over a number of years by the Generic Model Organism Database community, a schema they call Chado.

When we transformed FlyBase’s data to RDF, we used the Chado schema to help design the RDF data structures we were generating. In fact, we went one step further than that, and we semi-automatically generated an OWL ontology from the Chado relational schema. This ensured that we took a systematic approach to the data transformation, and that the definitions for the data structures in the output RDF could be grounded in the definitions already established by Chado.

Our approach to coreference resolution was, similarly, to make use of existing controlled vocabularies.

Our biggest problem was identifying genes. A single gene might be known in the scientific literature by many different names. This has been a perennial problem for Drosophila biologists, and a big part of what FlyBase has done has been to establish a definitive controlled vocabulary for Drosophila gene names, and curate a list of known synonyms for each gene.

So we used FlyBase’s unique gene identifier system as a foundation, constructed a set of URIs for Drosophila genes based on the FlyBase identifier. We then used these URIs to link data from each of the various sources.

You’ll notice I said “link data” just there. What do I mean by that?

Well, I’d like to make a distinction between two types of linking.

1. “semantically linked” – data from different sources use a common set of URIs to identify data entities, e.g., people, places, genes, diseases, (… or any two URIs that identify the same entity have been explictly mapped)

2. “web linked” – data are semantically linked, and URIs resolve to data so links can be followed by a crawler … this is what most people mean when they talk about “linked data”

We might also describe a third notion, …

3. “semantically aligned” – data from different sources use a common schema, that is, they share a common data model, (… or their respective data models have been explicitly mapped)

To build our cross-search applications, we went for the low-hanging fruit. I.e., we did just enough, and no more, to get them working.

This meant a very small amount of semantic alignment. In fact, it was quite possible to work around differences between data models, as long as those data models were understood. We certainly did not need to accomplish a complete and perfect alignment of all data models, before we could start building prototypes.

We also did not make the data web linked, i.e., we did not publish true linked data. Why not? Because it didn’t serve our immediate needs. We needed performant and queryable web services to data for each source, so we could build a mashup that selected the data it needed. Whether the data were actually linked in the web was, for this project, not relevant.

We did, however, work on semantically linking the data, i.e., mapping differences in the identifiers used, especially for genes, and this was absolutely critical to getting a reasonable level of recall and precision.

Which is not to say that I think true, web linked data is a bad idea. There may be other reasons for deploying web linked data, which would have been relevant especially if we had wanted to go beyond a proof-of-concept system.

But the point is, you can get a quick win if you make data available via a queryable web service. You save a lot of time and effort if data from multiple sources are semantically linked. But your data certainly don’t need to be perfectly and completely semantically aligned before you can start using them.

SKOS

I’d like to leave flies now, and return to SKOS.

As you all know better than I do, one of the cornerstones of information retrieval for many years has been the development of controlled structured vocabularies, such as the Library of Congress Subject Headings, or the Dewey Decimal Classification, or the Agrovoc Thesaurus.

Ever since the advent of the Web, there has been a desire to make better use of these valuable tools, to help organise and connect information as it emerges from closed silos and is shared via the Web.

Hopefully, the Simple Knowledge Organisation System (SKOS) will go some way towards enabling that to happen.

SKOS provides a common, standard, data model, for controlled structured vocabularies like thesauri, taxonomies and classification schemes.

This means that, if you own or have developed a controlled vocabulary, and would like to make it available for others to use, you can use SKOS to publish your vocabulary as linked data in the Web.

Because SKOS is now a standard, your data will be linkable with other vocabularies published in a similar way, and (hopefully) compatible with a variety of different software systems.

If you were at DC2008, you would have heard Ed Summers talk about his work to deploy the Library of Congress Subject Headings as linked data, using SKOS.

His initial work was deployed at an experimental site, but since then, based on Ed’s work, the Library of Congress has deployed their new Authorities and Vocabularies Service at id.loc.gov.

The first service deployed there is, of course, the LCSH.

To explain what LOC have done, for each heading in the LCSH, LOC have minted a URI for that heading.

For example, the URI http://id.loc.gov/authorities/sh95000541#concept identifies the LCSH heading for the World Wide Web.

If you plug that URI into the location bar of your browser, you’ll get a conventional web page providing a summary of that heading.

With a small change in the way you make that request, you can also retrieve a machine-readable representation of that heading. I.e., you can get data. Those data are structured using SKOS.

Each heading is, of course, linked to other headings, so you could, if you wanted to, follow the links from one heading to the next, collecting data along the way.

Alternatively, if you want to re-use the entire LCSH in your own application, you can download the whole thing in bulk, again as data, structured using SKOS.

I hope what the Library of Congress have done with LCSH achieves three things.

First, I hope it means more people use the LCSH. For all it’s quirks, the LCSH, like many other vocabularies, is an invaluable resource, and I hope we will see it turned to wild and wonderful new uses.

Second, I hope it encourages other projects to re-use the LCSH URIs, to link their metadata records to LCSH via the Web. That would make it much easier to make use of links between metadata records across existing collections.

Third, I hope LCSH serves as a hub for linking other vocabularies as they emerge into the Web.

I think that, in a short time, it is not unrealistic to imagine that, we could see LCSH as a hub in a web of linked vocabularies, with that web of vocabularies itself serving as a hub for a much larger and broader web of linked metadata.

That this is possible is demonstrated by the fact that the LCSH has already been linked to another vocabulary, the French RAMEAU vocabulary, used by the Bibliotheque nationale de France.

A second piece of work I’d like to highlight is Michael Panzer’s work on dewey.info.

One other project I would like to highlight from within the DCMI community is the work of Jon Phipps and Diane Hillman on the NSDL metadata registry.

In fact, to call it a registry doesn’t, in my mind, do it justice, because it is a complete vocabulary development, maintenance and publication platform, built using SKOS.

And it doesn’t only cover vocabularies, it covers metadata schemas (a.k.a., element sets) too.

And not only does the registry make all of the underlying data accessible via normal HTTP requests, (which you could use to implement linked data), it provides a SPARQL endpoint too, so you can query the schemas and vocabularies however you like.

The LCSH, dewey.info, and the NSDL metadata registry, are just three examples of recent uses of SKOS, a good source for more is the SKOS Implementation Report.

Let me be the first to say that SKOS isn’t perfect. Neither does it cover every eventuality. While thesauri, classification schemes and subject heading systems do have something in common, they also exhibit diversity. In many cases, that diversity is not just a historical artifact, but exists for good reason, because the vocabulary is adapted for a specialised purpose.

Our goal with SKOS was to capture enough of this commonality to enable some interoperability, but to provide an extensible foundation from which different communities could innovate, and explore solutions to their own particular problems.

We’ve seen already, at this conference, for example, in Michael Panzer and Marcia Zeng’s presentation, how work is well underway to develop extensions to SKOS for classification schemes.

This photo was taken during a discussion of how to extend SKOS for the Japanese National Diet Library Subject Headings.

On this note, I’d like to share a small insight, which I gained while I was working on SKOS, thanks to my colleagues in the W3C Semantic Web Deployment Working Group.

When I started working on SKOS, I thought that developing a standard was about getting everyone to do the same thing. I.e., it was about uniformity.

Now, I have a different perspective.

Consider that, if the developers of the original Web standards had tried to think of every possible way the Web might be used, then tried to design a complete system to accommodate all those possibilities they could think of, the Web would probably not exist today, for two reasons.

First, they would still be here today, imagining new possibilities, and arguing about how to deal with conflicting requirements.

Second, they would have built a system that was too complicated.

Whether by intention, inspiration, or accident, the original Web standards have not led to uniformity, but have rather led to an explosion of innovation and diversity.

Thus, my insight is that, a good standard, at least for the Web, is one that provides a platform for innovation. It musn’t try to do too much. Of course, it must be clear about everything that is within its scope, and so provide a sound basis for interoperability. But it should be aggressive about limiting its scope. And it must be flexible and extensible, to accommodate differences, and to enable unexpected ideas to be realised.

Striking this balance is, of course, far from easy, and I have no way of knowing whether SKOS has found the right balance. However, the people I have met through my work on SKOS continue to be an inspiration, and I hope at the very least it will provide a stepping stone to the future.

RDA, FRBR and RDF

Of course, when it comes to sharing and linking metadata, controlled vocabularies and SKOS are only a small part of the picture.

We also need standards for sharing and linking the metadata itself, standards that provide a basis for interoperability but that also can, as Michael Crandall beautifully illustrated on tuesday, accommodate the richness and complexity of of our descriptions, and of the artifacts they describe, be they literary works, works of art, or the results of scientific inquiry.

Here, too, there are opportunities to build on a significant body of previous work.

Two such bodies of work are the Functional Requirements for Bibliographic Records (FRBR), and the Anglo-American Cataloging Rules (AACR), which is the precursor to the Resource Description and Access (RDA) specification.

Now, I am not an expert on bibliographic metadata, so I cannot comment on the details of these standards.

However, I can tell you that, it is possible to take the data models underlying FRBR and RDA, to publish those data models using Semantic Web standards, and then to use those models as a framework for transforming existing metadata records to RDF and publishing them as linked data in the Web.

Earlier this year, I did a very modest amount of work with the DCMI-RDA task group, proving the concept. Using a set of cataloging examples, and using the RDA elements schema developed by the task group, I developed some patterns for representing bibliographic metadata as RDF.

I then tested these on a larger scale, using a dump of just under 7 million MARC records from LOC. I showed that at least some of the metadata from the MARC records could be transformed to RDF using the FRBR schema and RDA elements schema and vocabularies.

Next steps for this work would be to increase the coverage of the converted data, and to publish it not only as linked data but also via queryable (e.g., SPARQL) web services, which would drastically reduce the barrier to re-use of this fantastic resource.

Data-Sharing Networks for Malaria Research

I’d like to conclude my talk by returning to scientific research.

In June this year, after the FlyWeb Project finished, I moved up the hill in Oxford to join the Centre for Genomics and Global Health, which is a joint research programme of Oxford University and the Wellcome Trust Sanger Institute, directed by Prof. Dominic Kwiatkowski.

The main focus of our research is to assist the global campaign to eliminate malaria.

According to the WHO’s World Malaria Report 2008, half the world’s population is at risk of malaria, and an estimated 247 million cases led to nearly 881,000 deaths in 2006. Small children remain by far the most likely to die of the disease.

The recent advances in biotechnology that I mentioned earlier, in particular the rapid advances in DNA sequencing and genotyping technology are, of course, being brought to bear on the problem.

One of the most promising approaches is called genomic epidemiology, which combines genetic data from the lab with clinical data from the field, to understand why, for example, some people are less susceptible to serious infection than others. It is hoped that, by understanding the natural mechanisms of protective immunity against malaria work, this may contribute to the development of an effective malaria vaccine.

The crunch is that this type of research cannot be done on a small scale. Because genomic epidemiology involves analysing hundreds of thousands of points of variation in the human genome, and searching for associations between these genetic differences and different disease outcomes, a large number of samples (i.e., patients) need to be included, to gain the necessary statistical power to find genuine associations.

Thus genomic epidemiology requires research collaboration on an unprecedented scale. And the key to enabling this type of research is data-sharing.

Will the Semantic Web and linked data help? I hope so. The Web, in all its chaos and diversity, will certainly play a pivotal role. But the challenges are too broad to be solved by one family of technologies alone.

Many of the key challenges are social, rather than technological.

For example, enabling a scientific community (i.e, one not trained in data modeling) to quickly reach working agreements on data standards, and enabling scientists to translate between their own view of their data, and a standardised view of their data, is vital. Here, bridging the gap between the technology and the people has never been more important.

Similarly, reaching agreements on when data may be shared, and on how it may be used, is key. Because, of course, in addition to ensuring credit is received for individual scientific research, personal genetic and clinical data is highly sensitive, and there are strict ethical rules about data use and privacy.

In spite of these challenges, I remain hopeful. I am especially encouraged by the openness of communities like this one, and on the willingness of those communities to share their experience and expertise.

With that, I hope you enjoy the rest of the conference, and thank you for listening.

SKOS is a W3C Recommendation

Just a short post to say that the Simple Knowledge Organization System (SKOS) Reference is now a W3C Recommendation.

W3C issued the following press release: From Chaos, Order: W3C Standard Helps Organize Knowledge

I’m proud to have been a part of this work, and extremely grateful to all those who have supported and contributed over the last 5 years.

Running GWT Unit Tests in Manual Mode from Eclipse

A little tidbit, if you want to run GWT unit tests in manual mode from Eclipse, right click the test case class and select Run As > GWT JUnit Test as you would normally, which will create a run configuration for you. The first time round this will run the test in hosted mode. To get the test to run in manual mode, go to Run Configurations, select the run configuration for your test, then in the VM arguments box under the Arguments tab enter the following …


-Dgwt.args="-manual"

SKOS is a Candidate Recommendation

Almost two months ago now, the Semantic Web Deployment Working Group published the SKOS Reference Candidate Recommendation. Since then, we’ve had a good number of high quality implementations (see also Sean’s SKOS implementations spreadsheet), which is excellent news.

FlyWeb – Working Across Databases for Drosophila Functional Genomics

Over the last year or so, my main priority has been the FlyWeb Project. Unfortunately, FlyWeb was supported by short-term funding (18 months), and is coming to an end soon. Here are a few belated notes on what we did and why we did it…

The main goal of FlyWeb was to minimize the time required for a researcher in the domain of Drosophila (fruit fly) functional genomics, with no informatics training, to find and compare gene expression data from different databases on a large number of genes. With this in mind, we developed openflydata.org, which hosts the following cross-database gene expression data search applications:

The applications are all pure JavaScript, built using a custom library called FlyUI. They fetch data AJAX-style directly from four SPARQL endpoints, one for each of the four sources of genomic data. On the server side, we use Jena TDB as the underlying RDF storage and query engine, and SPARQLite as the SPARQL protocol server. The whole thing runs on a small EC2 instance.

Further details on our work to convert the four data sources to RDF, in addition to bulk RDF downloads, SPARQL endpoints and more, can be found at the links below:

Change of Contact Details

This is just a short post to say that I’m moving to a new role shortly, and so my contact details are changing also.

To reach me via email, use: alimanfoo at gmail dot com

I will be on leave from 15-31 May.

Semantic Web Deployment Final Face-to-Face

The W3C Semantic Web Deployment Working Group is kicking off it’s final face-to-face meeting at the Library of Congress in Washington, D.C. The main purpose of the meeting is to resolve outstanding issues for the Simple Knowledge Organization System (SKOS), which are summarised on the meeting agenda.

As an aside, I heard recently about the deployment of the Library of Congress Subject Headings (LCSH) as linked data in the Web, using SKOS. This nice work provides a great backdrop to our meeting.

Installing RDFLIB on Windows, and Making it Work with PyDev

I had a few troubles installing RDFLIB, the Python RDF library, on my Windows Vista laptop, and getting everything to work with PyDev in Eclipse.

I have Python 2.5 installed from the MSI. When I ran python setup.py install from the RDFLIB download directory, I got the message:

error: Python was built with Visual Studio 2003;
extensions must be built with a compiler than can generate compatible binaries.
Visual Studio 2003 was not found on this system. If you have Cygwin installled,
you can try compiling with MingW32, by passing "-c mingw32" to setup.py.

I have Cygwin installed, so I installed Cygwin’s Python 2.5, then used that to run python setup.py install, which worked fine.

However, when I tried to use “Run As … Python unit-test” from within Eclipse (with PyDev installed), it didn’t work. Apparently, there are compatibility problems between PyDev and Cygwin, mostly related to windows path names.

So I went back to trying to install RDFLIB using the Windows Python. I could run python setup.py build -c mingw32 (with gcc-mingw32 installed and cygwin’s binaries directory on my path), but I still couldn’t run python setup.py install because the ‘install’ command doesn’t accept the ‘-c’ argument.

Eventually, I made it work by creating a cfg file for distutils (distutils.cfg) eg: /c/Python2x/Lib/distutils/distutils.cfg containing:

[build]
compiler=mingw32

as described here at the end of the page, under “One Last Step”.

I.e. once I had created the cfg file, and the cygwin binaries were on my path, I could run python setup.py install using the Windows Python, which also works with PyDev in Eclipse.

Request for Comments — SKOS Reference — W3C Working Draft 25 January 2008

The W3C Semantic Web Deployment Working Group has announced the publication of the SKOS Reference as a W3C First Public Working Draft:

This is a substantial update to and replacement for the previous SKOS Core Vocabulary Specification W3C Working Draft dated 2 November 2005. The publication has been announced in the W3C news, and a request for comments has been sent to various mailing lists.

The abstract from this new specification:

This document defines the Simple Knowledge Organization System (SKOS), a common data model for sharing and linking knowledge organization systems via the Semantic Web.

Many knowledge organization systems, such as thesauri, taxonomies, classification schemes and subject heading systems, share a similar structure, and are used in similar applications. SKOS captures much of this similarity and makes it explicit, to enable data and technology sharing across diverse applications.

The SKOS data model provides a standard, low-cost migration path for porting existing knowledge organization systems to the Semantic Web. SKOS also provides a light weight, intuitive language for developing and sharing new knowledge organization systems. It may be used on its own, or in combination with formal knowledge representation languages such as the Web Ontology language (OWL).

This document is the normative specification of the Simple Knowledge Organization System. It is intended for readers who are involved in the design and implementation of information systems, and who already have a good understanding of Semantic Web technology, especially RDF and OWL.

For an informative guide to using SKOS, see the upcoming SKOS Primer.

Synopsis

Using SKOS, conceptual resources can be identified using URIs, labeled with lexical strings in one or more natural languages, documented with various types of note, linked to each other and organized into informal hierarchies and association networks, aggregated into concept schemes, and mapped to conceptual resources in other schemes. In addition, labels can be related to each other, and conceptual resources can be grouped into labeled and/or ordered collections.

On the OAIS Information Model as a Platform-Independent Model (PIM) in a Model-Driven Software Architecture

Abstract

This short paper summarises some work done on the possibility of using OAIS information model as a basis for the model-driven design and implementation of components within a digital preservation software architecture. Two model transformations were defined using the Enterprise Architect template language. The first model-transformation transforms a platform-independent UML class model (PIM) into a set of UML interfaces specific to the Java 1.5 platform (here called a Java API model). The second model-transformation transforms a platform-independent UML class model (PIM) into set of UML classes specific to the Java 1.5 platform, implementing the interfaces generated by the first model-transformation (here called a Java implementation model). Both were applied to the OAIS information model as PIM, and the generated models are presented here with discussion.

Read the rest of this entry »

Using PicaJet and Flickr to Manage Photos on the Desktop and Online

I’ve been looking around for something to help me manage my burgeoning photo collection. I’ve got a Sony Ericsson K800 and a Nikon D40, and between the two of them I’m generating quite a few images. Adobe Photoshop Album Starter Edition came with my mobile phone software, so I tried that to start with. The tagging interface worked well for me — a quick once through tagging with who, where, when and occasionally what is all I ever have time for, and is usually enough to allow me to find an image again. However, the two things that bugged me about Photoshop Album were (1) that there was no integration with Flickr, so if I uploaded photos I’d have to retag them completely, and (2) I couldn’t export my photo catalog or move it between computers easily.

After a not too exhaustive search on the Web, I found PicaJet, and downloaded the free edition. I was encouraged because the tagging interface is great (very similar to Photoshop Album), and because PicaJet has an integrated Flickr uploader which preserves all of your tagging. I also discovered that the photo catalog can be easily exported, so in a nutshell, PicaJet ticks my boxes. You can do quite a lot with the free edition — tag photos, upload to flickr, some basic editing. I’ll be upgrading to PicaJet FX (the full version, around £30) mainly because I want to be able to do more with the tag categories — in the free version you can only have a two-level hierarchy, and you can’t add new top-level categories.

I tried Picasa2, but that doesn’t have any tagging support or Flickr integration.

I also downloaded Microsoft Photo Gallery, which advertises Flickr integration. The installation process was painfully slow, then the application crashed when I tried to launch it on my bog-standard Windows XP machine.

SKOS and RDFa in e-Learning

The W3C’s Semantic Web Deployment Working Group is developing two new technologies which may be relevant to e-learning technology. These are the Simple Knowledge Organisation System (SKOS), and RDFa.

SKOS is a lightweight language for representing intuitive, semi-formal conceptual structures. So, for example, the figure below (taken from the SKOS Core Guide) depicts concepts with intuitive hierarchical and associative relationships to other concepts, and with preferred and alternative labels in one (or more) languages — these are the kinds of structures that can be expressed using SKOS. Once expressed in this form, conceptual structures can easily be published on the Web, shared between applications, linked/mapped to other conceptual structures and so on. Typically, these conceptual structures are used as tools for navigating around complex or unfamiliar subject areas, for retrieving information across languages, and for bringing together related information from different sources.

RDFa is a language for embedding richly structured data and metadata within Web pages. This allows a Web page to expose much of its underlying meaning to applications, enabling a range of new functionalities within Web clients, exchanging data between Web sites, services, and the users’ desktop applications. For example, a Web page about a new music album can use RDFa to embed structured data expressing facts about that album, such as the track listing, artist, links to sample media files etc. A Web browser with a suitable plugin or extension can use this data to offer new functions to the user, such as download the tracklisting with available samples to my music library, or compare prices from online vendors.

Both of these technologies are on the W3C Recommendation track, and are scheduled for completion in April 2008.

Read the rest of this entry »