210 likes | 287 Views
Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology University of Oxford. FlyWeb: the way to go for biological data integration. FlyWeb Application. To answer questions about "what does this gene do?” Gene Expression Images
E N D
Jun Zhao, Alistair Miles and Graham Klyne Image Bioinformatics Research Group Department of Zoology University of Oxford FlyWeb: the way to go for biological data integration
FlyWeb Application To answer questions about "what does this gene do?” Gene Expression Images Sequence and ESTs (Expressed sequence tags) of the gene Publications about the gene .... A first example of the Image Web that our group is developing Investigate the feasibility of existing Semantic Web tools and technologies for real applications
Gene expression images Reveal gene expression pattern in different development stages Important for identifying genes of interests and verifying a picture of probable gene functions
FlyWeb demonstration http://openflydata.org/flyui/build/apps/imagemashup2/ Run application: [go] Two examples: Single gene query (aos1) Use gene synonyms to enhance gene matching (rbf)
More than one synonyms of gene “rbf”
How does it work? Data from 3 independent sources: www.flybase.org – model organismreference database, gene namesand identifiers www.fruitfly.org (BDGP) – embryo in situ images www.fly-ted.org – testis in situ images All data accessed via SPARQL Pure Ajax user application Essentially, a mashup using a SPARQL API
The client side FlyUI: a library of Javascript widgets as front ends to SPARQL data sources Built on Yahoo User Interface (YUI) library Widgets are composed in a browser to create the complete application Each widget provides: A Service that implements SPARQL queries A Model encapsulating SPARQL query results A Renderer The in situ search application GeneFinder Widget FlyTED Image Widget BDGP Image Widget
Gene name mapping FlyTED and BDGP use different gene names FlyTED data derived from spreadsheets with imperfectly controlled gene name vocabulary BDGP's data are annotated using FlyBase's unique FBgn numbers Use FlyBase for automatic gene mapping Additional inputs from scientists for disambiguating many-many mappings Mappings are stored as JSON file to assist “GeneFinder” widget (having no use for RDF/OWL reasoning at this stage)
SPARQL queries Free text matchings Case insensitive searching Very important for our users Too expensive using SPARQL Filter Pre-generate lower-case gene names and load into the Flybase RDF DB SELECT * WHERE { ?gene fbutil:anyName "userInput"^^xs:string ; a chado:Feature ; chado:name ?symbol ; chado:uniquename ?flybaseID . OPTIONAL { ?gene chado:dbxref [ chado:accession ?annotationSymbol ] . } OPTIONAL { ?gene chado:synonym [ chado:name ?synonym ] . } OPTIONAL { ?gene chado:synonym [ a syntype:FullName ; chado:name ?fullName ] . } } SELECT DISTINCT * WHERE { ?fullImageURL " + flyted:associatesToGene <http://openflydata.org/id/flyted/gene-geneName> ; flyted:associatesToGene ?gene ; flyted:thumbnail ?thumbnailURL; rdfs:seeAlso ?flytedURL; rdfs:label ?caption }
The RDF data sources Flybase and BDGP: relational databases FlyTED, an image repository built using Eprints FlyAtlas (forthcoming), tissue-specific Drosophila gene expression levels, as a single spreadsheet
Creating RDF from data sources D2RQ mapping FlyBase and BDGP, native relational databases Conservative mapping, with minimum interpretation OAI2SPARQL Harvesting N3 RDF metadata via the OAI-PMH protocol, built-in support by Eprints Further from ESWC2008 paper Custom Python program FlyAtlas Generating N3 from spreadsheet table
More about the data sources Bulk download http://openflydata.org/dump/flybase, ~8m triples http://openflydata.org/dump/bdgp, ~1m triples http://openflydata.org/dump/flyted, ~30,000 triples SPARQL endpoint http://openflydata.org/query/flybase http://openflydata.org/query/bdgp http://openflydata.org/query/flyted Schema http://purl.org/net/chado/schema/ http://purl.org/net/flybase/synonym-types/ http://purl.org/net/bdgp/schema/
SPARQL server Amazon EC2 (Elastic Compute Cloud): To run SPARQL endpoints To host the demo you've just seen Jena TDB as triple store For better loading performance: ~6K tps for ~9M triples to Amazon Elastic Block Storage (EBS) For better querying performance SPARQLite home-grown SPARQL protocol implementation More later Apache, Tomcat, mod_jk, etc.
SPARQLite protocol http://sparqlite.googlecode.com Also, a platform for exploring SPARQL service quality concerns, more later Motivation Enable streaming Create a database connection pool Designed for Jena TDB/SDB + Postgres Restricted forms of query (SELECT, ASK) Restricted query result format (e.g. only JSON)
Lessons RDF provides a uniform and flexible data model RDF dump is cheaper and quicker Maintaining a separate SPARQL endpoint for each data source makes it easier than a data warehouse approach for handling data updates RDF facilitates data re-use and re-purposing SPARQL raises the point of departure for an application Benefits for the future Linking to other data sources Querying genes using the Fly Anatomy ontology Magic of inference
Performance Loading: Our datasets ~10 million triples Jena / RDB / Postgres, OK with <1 M triples Jena / SDB / Postgres better, but problems with load performance with larger datasets Jena / TDB gives much better load performance (~6K tps), even on 32 bit system with Amazon EBS storage (but not so good with local EC2 store) Virtuoso performs reasonably well Querying, particularly text matching and case insensitive search Problems with using SPARQL regex filter, the only mechanism for case-insensitive search in SPARQL Tried with OpenLink Virtuoso, still ~10 seconds for a case-insensitive search Any suggestions?
Further lessons SPARQL results streaming Resolves out of memory errors for large datasets Joseki / SDB / Postgres can be made to stream results, but using just a single JDBC connection, causing performance problems with concurrent requests Therefore, SPARQLite The openness of SPARQL: SPARQL is an inherently open query language and protocol Open endpoints are vulnerable to simple queries that can overload the service, exposing them to denial of service style attacks (whether intended or not) Futures: API key mechanism? Restricted SPARQL profiles?
Future directions Adding new data sources: FlyAtlas tissue-specific Drosophila gene expression levels More information from FlyBase – e.g. references More applications: Find out all the gene expression images of its neighbours Find out all the genes related to “blood pressure” ... Linked data (dereferencable, follow-your nose) We're thinking about this, but our application does not currently need it How to control and predict quality of service for open SPARQL endpoints
Acknowledgement Alistair Miles, Graham Klyne and David Shotton Dr Helen White-Cooper and her research group BBSRC for funding building the FlyTED database BDGP and FlyBase for making the data available JISC, for funding the FlyWeb project The Jena team, esp. Andy Seaborne