Using SPARQL for Biological Data Integration – Reflections on openflydata.org and the FlyWeb Project

by Alistair Miles

It’s now almost 18 months since the end of the FlyWeb project and the development of the proof-of-concept site openflydata.org, so I thought it was high time to write up a few reflections. Thanks to Dr David Shotton, head of the Image Bioinformatics Research Group, for giving me the chance to work on FlyWeb, it was a great project.

If you want to know more about the technical side of the work, see the paper “OpenFlyData: An exemplar data web integrating gene expression data on the fruit fly Drosophila melanogaster” in the Journal of Biomedical Informatics.

Integrating Gene Expression Data

We wanted to help reduce the amount of time spent by a Drosophila functional genetics research group on experimental design and on validating experimental results. Experimental design includes selecting genes that might be relevant to the biological function of interest (in this case, male fertility). Validating results includes checking your gene expression data against other published data for the same gene – a discrepancy suggests an artifact or problem in your data, or in the published data … either way it’s likely to be important.

The problem here is a common refrain – the relevant data are not all found in the same place. Trawling and querying multiple sites and manually compiling the results takes a lot of time. Could we build some tools that help bring these data together?

Technology Hypothesis – Data Webs, Semantic Web & SPARQL

We wanted to be as user-driven as possible, i.e., to stay focused on what the researchers needed, and to be open-minded about technology, using whatever tools made us most productive.

But we did have a technology hypothesis, which was part the reason why JISC funded the FlyWeb project. Our hypothesis was that building a data integration solution for our Drosophila researchers using Semantic Web standards and open source software would be (a) feasible, and (b) reasonably cost-efficient. David Shotton and Graham Klyne, the original proposal authors, had also previously developed a vision for “data webs“, an architectural pattern for integrating a set of biological data within a specific domain or for a specific purpose. Would the data webs pattern help us to build a solution?

So how did we marry these two forces: on the one hand being user-driven, on the other having a technology hypothesis that we wanted to test?

Well, in the spirit of agile, we tried to make our development iterations as short as possible. I.e., we tried to work in a way that meant we had something to put in front of users in the shortest possible time frame. When we were discussing architectural patterns and technologies, and several alternatives looked to be of similar complexity or difficulty, we favoured approaches involving RDF, OWL or SPARQL, and that were closer to the original data webs vision.

However, our goal was not to prove that a solution based on semweb standards and tech was any cheaper or better than a non-semweb alternative; just that it was possible and not prohibitively expensive. This is interesting because, if data integration solutions in related problem domains were all based on semweb standards, then they also might play together, as well as solving their own particular problems … or so the argument goes. I.e., there would be some re-use or re-purposing benefit from each individual data integration solution, and maybe some network effect, if everyone used SPARQL, for example. Of course there would be work involved in linking two data web solutions, but it might be less because at least we’d bottom out at the same standards – RDF, SPARQL, and maybe even some shared ontologies.

But you can’t even begin to talk about network effects if you can’t first show that you can solve specific problems effectively and cheaply. I.e., solutions need to make sense locally.

SPARQL Mashups

An architectural pattern that we adopted early in the project was the “SPARQL mashups” pattern. A SPARQL mashup is an HTML+JavaScript application that runs entirely in the browser, and that retrieves data directly from two or more SPARQL endpoints via the SPARQL protocol.

To see a SPARQL mashup in action, go to http://openflydata.org/search/gene-expression, click the “show logger” link at the bottom of the page (or open firebug if you’re in firefox), then type “schuy” into the search box. You should see SPARQL queries being sent to various SPARQL endpoints and result sets being returned in the JSON SPARQL results format.

For example, here’s the query that finds genes from FlyBase matching the query term “aly”:

# Select feature short name, unique name, annotation ID, and official full name, given 
# any label and where feature is D. melanogaster gene.

PREFIX xsd: 
PREFIX chado: 
PREFIX skos:     
PREFIX so: 
PREFIX syntype: 

SELECT DISTINCT ?uniquename ?name ?accession ?fullname WHERE {

  ?feature skos:altLabel "aly" ; 
    a so:SO_0000704 ;
    chado:organism  ;
    chado:uniquename ?uniquename ;
    chado:name ?name ; 
    chado:feature_dbxref [ 
      chado:accession ?accession ; 
      chado:db <http://openflydata.org/id/flybase/db/FlyBase_Annotation_IDs>
    ] .

  OPTIONAL {
    ?fs 
      chado:feature ?feature ; 
      chado:is_current "true"^^xsd:boolean ;
      chado:synonym [ 
        a syntype:FullName ;
        chado:name ?fullname ; 
      ] ;
      a chado:Feature_Synonym .
  }

}

Each of the panels in the UI corresponds (more-or-less) to a data source. The search term is first used in a SPARQL query to the FlyBase endpoint to find matching genes. If there is only a single gene matching the query, the gene is automatically selected, and further SPARQL queries are then sent to other data sources (e.g., FlyAtlas, BDGP, Fly-TED) to retrieve gene expression data relevant to that gene. If more than one gene matches the query (e.g., try “aly”) the user has to select a gene before the next set of queries are dispatched.

Why did we use the SPARQL mashup pattern?

Well, it allowed us to use some off-the-shelf open source software. All we had to do was code a transformation from the data in its published format to RDF. Once we had an RDF dump for each data source, we loaded the data into a triple store (we used Jena TDB) then deployed the store as a SPARQL endpoint via a SPARQL protocol server (we used Joseki initially, then SPARQLite).

Once we had a SPARQL endpoint for each data source, we could develop a simple HTML+JavaScript application in the spirit of a conventional mashup, using the SPARQL protocol as the API to the data.

A nice feature of using SPARQL here is that you don’t have to think about the API to the data, at least not from a web service point of view. The SPARQL protocol and query language basically give you an API for free. All you have to figure out is what query you need to get the right data. And you don’t need to write any code on the server side, other than that required to transform your data to RDF.

Also, because your API supports a query language (SPARQL), you don’t need to know up-front exactly what data you need or what questions you’re going to ask (although obviously it helps to have a rough idea). I.e., if you get half-way through coding your mashup and realise you need to query the data in a different way, or retrieve more or less data, you just tweak the SPARQL query you’re sending. I.e., there are no consequences for your server-side code, your API can already handle it.

This also means your API can handle unanticipated use cases. I.e., if someone else wants to query the data for a completely different purpose, chances are they can already do it – the expressiveness of SPARQL means that the chances others will be able to use your data are high. Although this wasn’t a motivation in our project, we liked the idea.

Dealing With Big(-ish) RDF Data

As we scaled up from initial prototypes, we hit a few snags. The biggest challenge was dealing with the FlyBase data, which amounted to about 180 million triples in our final dump. Also, queries had to be quick, because users of the mashup apps are waiting for SPARQL queries to evaluate in real time. Here’s a few tricks we found for working with RDF data at this scale.

  • Fast data loading – For data loading, we found we could get between 15,000 and 30,000 triples per second from Jena TDB on a 64-bit platform. That meant the FlyBase dataset loaded in somewhere between 1.5 and 3 hours. To load the data, we fired up a large EC2 instance, and loaded the data onto an EBS volume. When the load was done, we detached the volume and attached it to a small instance which hosted the query endpoint, and shut down the large instance to keep running costs down. We didn’t try this, but using a RAID 0 array and striping your data across multiple EBS volumes might increase load performance even further (there’s a nice article by Eric Hammond on using RAID 0 on EC2).
  • Everything has to be streaming – The transformation from source format (e.g., relational database) to RDF has to be streaming. The SPARQL query engine has to be streaming. And the SPARQL protocol implementation has to be streaming. That’s part of why we rolled our own SPARQL protocol implementation in the end (SPARQLite) – Joseki at the time did not write result sets in a streaming fashion, for valid reasons, but which limits scalability.
  • To get good query performance we pre-calculated some data. E.g., when we wanted to do a case-insensitive match against RDF literals in a query pattern, we computed the lower-case version of the literal and added it to the data as extra triples, then wrote queries with literals in lower case too – rather than, say, using a regex filter. SPARQL queries go much faster when they have a concrete literal or URI node to work from early in the query; queries with loose patterns and FILTERs can be very slow, because you’re pushing a lot of triples through the filters. We did also try using the SPARQL-lucene integration (LARQ) for text matches, but couldn’t get this to quite fast enough (sub 3s) for the FlyBase gene name queries, although it was used heavily in some other projects (CLAROS and MILARQ). You can also make queries go faster by shortening query paths. E.g., if you have a pattern you want to query like { ?x :p ?y . ?y :q “foo”. } your query may go faster if you first invent a new predicate :r and compute some new triples via a rule or query like CONSTRUCT { ?x :r ?z } WHERE { ?x :p ?y . ?y :q ?z. }, then add these triples to your dataset and query using the pattern { ?x :r “foo” } instead.
  • Beware that how you write your query may make a difference. Depending on which optimiser you use, TDB will do some re-ordering of the query to make it go faster (I believe to put more selective bits earlier), but if you know your data well (statistics are helpful) then writing the query with this in mind can help the query engine. E.g., if you have a triple pattern with a specific predicate and a specific subject or object that you know should only have a few matches, put this right at the top of the query. Basically, put the most discriminating parts of the query as early as possible. This also means that often triple patterns with rdf:type are not that helpful early on, because they don’t narrow down the results much, although this is what you tend to put first for readability.
  • Test-driven data – When you generate a large RDF dataset, you need to be sure you got the transformation right, and the data is as you expect it to be, otherwise you can waste a lot of time. I.e., you need to be able to test your triples. We designed some simple test harnesses for our data, where a set of test SPARQL queries were run against the data. Each SPARQL query was an ASK or SELECT and the test case defined an expectation for each query result. For very large datasets, you may also want to code some sanity checks on the n-triples dump before trying to load into a triplestore and test with SPARQL, e.g., scanning with grep and/or awk to find triples with predicates you expect to be there.

Open SPARQL Endpoints – Mitigating Denial of Service

Above I mildly extolled the virtues of SPARQL as an API to data – anyone can write the query they need to extract the data they want, and you don’t need to anticipate all requirements a priori.

The obvious downside to the expressiveness of SPARQL and openness of SPARQL endpoints is that they are vulnerable to accidental or intential denial of service attacks. I.e., someone can write a hard query and tie up your query engine’s compute and/or memory resources, if not crash your box.

Although deploying a production service or guaranteeing service levels wasn’t part of our remit, we were concerned that unless we could mitigate this vulnerability, SPARQL outside the firewall would never really be useful beyond a proof-of-concept. I.e., we would never be able to advertise our endpoints as a production web service, so that others could write mashups or other applications that query the data and depend on the service.

We spent a bit of time working on this, and this may be a solved problem now in newer query engines, but at the time we thought to place some limits on the queries that open endpoints would accept. For example, SPARQLite endpoints could be configured to disallow queries with triple patterns with variable predicates, or FILTER or OPTIONAL clauses, or to enforce a LIMIT on all queries’ result sets. This is not a complete solution, because you could still write hard queries, but at least it removed some of the obvious attacks. A better solution would probably involve monitoring queries’ resource usage and killing any that take too long or consume too much resources – a bit like how Amazon’s SimpleDB places limits on service usage, including a 5 second maximum query execution time.

Mapping to RDF

The elephant in the room here is mapping the data to RDF, and that’s where a lot of the work went. All of our data sources came in some non-RDF format, either as CSV files or a relational database. For the CSV sources we hand-coded RDF transformations as Python scripts. For the relational databases, we made heavy use of D2RQ, although we did not use D2R server to transform SPARQL queries to SQL on-the-fly due to performance and scalability issues, rather we used the D2R dump utility to generate a complete RDF dump of an SQL datasource in n-triples format, then loaded that into a Jena TDB triplestore which backed our SPARQL endpoints.

The main issue was the time it takes to design a mapping from a fairly complex relational schema like Chado to RDF. Rather than trying to find one or more existing, published, ontologies to use in the RDF outputs of the mapping, and designing the mappings by hand, we tried a different approach. Inspired by model-driven engineering, we developed a Python utility which, driven by some simple annotations on the source SQL schema definition, generated both a suitable OWL ontology and a complete D2RQ mapping file. This worked well with a schema like Chado which has consistent structural patterns and naming conventions. There’s a worked example in the supplementary information (S4) to the OpenFlyData paper in the Journal of Biomedical Informatics.

The problem with this approach is, of course, that you end up with one schema/ontology per data source. Initially we thought this would force us to do some ontology alignment and to map everything to a common ontology, but we quickly realised this just wasn’t necessary. The mashup applications quite happily query each source according to its own ontology, and have just enough knowledge of what each ontology means to integrate the results in a sensible way. I.e., you can develop applications that work with multiple data sources without perfect (or even partial) ontology alignment. Obviously, aligning ontologies is desirable, but that can be a long-term ambition – using ontologies derived from the source data at least gets you started, and gets you talking about data semantics rather than getting bogged down by differences in syntax, formats or protocol (because RDF and SPARQL are the interlingua for these).

Lasting Impressions

The message I took away from this project is that, if you already have some data, and you want to make the data available to web application developers and other hackers in a useful way, then SPARQL can be a good option. It’s fairly straightforward (even, dare I say, fun) to code simple HTML+JavaScript mashups that bring data from different SPARQL endpoints together on-the-fly (pardon the pun). SPARQL won’t be a panacea, and you may find some queries just aren’t quick enough to evaluate in real time, so you may have to find ways to optimise these queries when moving to production, but it’s worth doing some benchmarking, as triplestores like Jena TDB are quick for certain types of query.

The pain comes when you need to convert data to RDF. But you don’t need to get hung up on finding the right ontologies or designing a perfect or even complete mapping. Convert what you need, using a custom ontology that is designed for your application or generated from the source data, and just get going – you’ll have plenty of iterations to refactor the data.

Would I use SPARQL again? Yes, for read-only data services and data integration webapps, I’d definitely consider it. And there are some new features coming in SPARQL 1.1 which look very useful. If someone solves the denial-of-service problem for open SPARQL endpoints (and they may already have) then the case for SPARQL as a data-sharing standard is compelling. Certainly an area to watch.

About these ads