470 likes | 562 Views
The Explicator Project: Integrating Astronomy Data with Semantic Web Tools. Alasdair J G Gray Information Management Group Seminar University of Manchester 12 th August 2009. The Explicator Project. Duration: July 2007 – September 2009 Team:
E N D
The Explicator Project:Integrating Astronomy Data with Semantic Web Tools Alasdair J G Gray Information Management Group SeminarUniversity of Manchester12th August 2009
The Explicator Project Duration: July 2007 – September 2009 Team: • Stuart Chalmers (Computing Science, Glasgow)February 2009 – September 2009 • Alasdair J G Gray (Computing Science, Glasgow)July 2007 – January 2009 Investigators: • Norman Gray (Physics and Astronomy, Leicester/Glasgow) • Paul Millar (Physics and Astronomy, Glasgow) • IadhOunis (Computing Science, Glasgow) • Graeme Stewart (Physics and Astronomy, Glasgow)
Outline • Motivation: The Virtual Observatory • Semantic Data Discovery • Which data sources potentially contain relevant data? • Semantic Data Integration • Can SPARQL be used to express scientific queries? • Can existing archives be exposed with semantic tools? • Can RDB2RDF tools extract large volumes of data? A.J.G. Gray — IMG Seminar, University of Manchester
Context: Astronomy • Data collected across electromagnetic spectrum • Traditionally analysed within one wavelength • Data collection is • expensive • time consuming • Existing data • large quantities • freely available Image: Wikipedia A.J.G. Gray — IMG Seminar, University of Manchester
Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” A.J.G. Gray — IMG Seminar, University of Manchester
Searching for Brown Dwarfs • Data sets: • Near Infrared, 2MASS/UK Infrared Deep Sky Survey • Optical, APMCAT/Sloan Digital Sky Survey • Complex colour/motion selection criteria • Similar problems Image: AstroGrid A.J.G. Gray — IMG Seminar, University of Manchester
Deep Field Surveys • Observations in multiple wavelengths • Radio to X-Ray • Searching for new objects • Galaxies, stars, etc • Requires correlations across many catalogues • ISO • Hubble • SCUBA • etc Image: Hubble Space Telescope A.J.G. Gray — IMG Seminar, University of Manchester
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray — IMG Seminar, University of Manchester
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Which data sources contain relevant data? • How do I query the relevant data sources? • How can I interpret/combine/analyse the data? Virtual Observatory A.J.G. Gray — IMG Seminar, University of Manchester
Finding relevant data sources Which data sources contain relevant data? A.J.G. Gray — IMG Seminar, University of Manchester
Which data sources do I use? • VO registry • 65,000+ entries • Many mirrored services • VOExplorer • Registry search tool • Resources tagged with keywords • 6df • survey • galaxy • galaxies • redshift • redshifts • 2mass A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Registry Keywords Problems: • Plural/singular • Case • Abbreviations • Different tags • Specificity of tags Thanks to Sébastien Derriere for this data. 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Registry Keywords Problems: • Plural/singular • Case Solution: (standard IR techniques) • Stemming • Star & Stars become Star • Galaxy & Galaxies become Galax • Case normalisation • lowercase 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Registry Keywords Problems: • Abbreviations • Different tags • Specificity of tags Solution: Need to understand semantics! 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray — IMG Seminar, University of Manchester
Semantic Options • Folksonomies • Keyword tags, freely chosen • Vocabulary • Controlled list of words with definitions • Taxonomy • Relationships: Broader/Narrower/Related • Thesaurus • Synonyms, antonyms, see also • Ontology • Formal specification of a shared conceptualisation – OWL “Vocabulary” used to covervocabularies, taxonomies, and thesauri. Image: Leonard Cohen Search A.J.G. Gray — IMG Seminar, University of Manchester
Controlled Vocabulary A set of terms with: • Label • Synonyms • Definition • Relationships to other terms: • Broader term • Narrower term • Related term Example: • “Spiral galaxy” • “Spiral nebula” • “A galaxy having a spiral structure” • Relationships carrying semantic information: • BT: “Galaxy” • NT: “Barred spiral galaxy” • RT: “Spiral arm” A.J.G. Gray — IMG Seminar, University of Manchester
Existing Vocabularies in Astronomy • Journal Keywords • Developed for tagging papers • 311 terms • Actively used • Astronomy Visualization Metadata (AVM) • Tagging images • 217 terms • Actively used • IAU Thesaurus • Developed for libraries in 1993 • 2,551 terms • Never really used • Unified Content Descriptor (UCD) • Tagging resource data • 473 terms • Actively used A.J.G. Gray — IMG Seminar, University of Manchester
Common Vocabulary Format Requirements: • Provide term identifiers • Unambiguous tagging • Capture semantic relationships • Poly-hierarchy structure • Machine processable • Allows inter-operability • “Machine intelligence” • Avoids problems of: • Spelling • Case • Plurality problems • Tags • Automated reasoning: • Interested in all “Supernova” • Items tagged as “1a Supernova” also returned A.J.G. Gray — IMG Seminar, University of Manchester
SKOS • W3C standard for sharing vocabularies • Based on RDF • Semantic model for describing resources • Provides URI for each term • Captures properties of terms • Encodes relationships between terms • Enables automated reasoning • Standard serialisations • “Looser” semantics than OWL • Adopted by IVOA as a standard for vocabularies A.J.G. Gray — IMG Seminar, University of Manchester
Example SKOS Vocabulary Term Example “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” In turtle notation #spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm . A.J.G. Gray — IMG Seminar, University of Manchester
Inter-operable Vocabularies Which vocabulary should I use? Inter-vocabulary mappings Broad match: more general term Narrow match: more specific term Related match: associated term Exact match: equivalent term Close match: similar but not equivalent term • One that you know! • Closest match to your needs • Vocabulary terms related using mappings • Part of the SKOS standard • One mapping file per pair of vocabularies A.J.G. Gray — IMG Seminar, University of Manchester
Putting it all together • Use vocabulary concepts for • Tagging (using URI) • Resources in the registry • VOEvent packets • Searching by vocabulary concept • User keyword search converted to vocabulary URI • Provides semantic advantages • Reasoning about terms • Relationships (Intra-vocabulary) • Mappings (Inter-vocabulary) • Requires a mechanism to convert a string to a concept A.J.G. Gray — IMG Seminar, University of Manchester
Vocabulary Explorer • Search and browse vocabularies • Configure • Vocabularies • Mappings • Uses Terrier Information Retrieval Platform • Matching mechanisms • Ranking results http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/ A.J.G. Gray — IMG Seminar, University of Manchester
Search Results • Terrier IR Platform • Evaluation over 59 queries • nDCG evaluation model (distinguishes highly relevant/relevant/not relevant) A.J.G. Gray — IMG Seminar, University of Manchester
Finding the Right Term: Conclusions • Vocabularies improve search • Remove ambiguity • Increase precision and recall • Enable • Reasoning about relevance • Faceted browsing • Provided tools for working with vocabularies • Reliable search from keyword string to vocabulary term • Exploration of vocabularies • Mapping terms across vocabularies (not shown) A.J.G. Gray — IMG Seminar, University of Manchester
Extracting relevant data How do I query the relevant data sources? A.J.G. Gray — IMG Seminar, University of Manchester
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray — IMG Seminar, University of Manchester
A Data Integration Approach • Heterogeneous sources • Autonomous • Local schemas • Homogeneous view • Mediated global schema • Mapping • LAV: local-as-view • GAV: global-as-view Query1 Queryn Relies on agreement of a common global schema Global Schema Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray — IMG Seminar, University of Manchester
P2P Data Integration Approach • Heterogeneous sources • Autonomous • Local schemas • Heterogeneous views • Multiple schemas • Mappings • From sources to common schema • Between pairs of schema • Require common integration data model Can RDF do this? Query1 Queryn Schemaj Schema1 Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray — IMG Seminar, University of Manchester
Integrating Using RDF • Data resources • Expose schema and data as RDF • Need a SPARQL endpoint • Allows multiple • Access models • Storage models • Easy to relate data from multiple sources Common Model (RDF) SPARQL query We will focus on exposing relational data sources Mappings RDF / Relational Conversion RDF / XML Conversion Relational DB XML DB A.J.G. Gray — IMG Seminar, University of Manchester
RDB2RDF: Two Approaches Extract-Transform-Load Query-driven Conversion Data stored as relations Native SQL query support Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF • Data replicated as RDF • Data can become stale • Native SPARQL query support • Limited optimisation mechanisms Existing RDF stores • Jena • Sesame A.J.G. Gray — IMG Seminar, University of Manchester
System Test Hypothesis Is it viable to perform query-driven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? Common Model (RDF) SPARQL query SPARQL query Mappings RDB2RDF RDF / XML Conversion Relational DB XML DB A.J.G. Gray — IMG Seminar, University of Manchester
Astronomical Test Data Set • SuperCOSMOS Science Archive (SSA) • Data extracted from scans of Schmidt plates • Stored in a relational database • About 4TB of data, detailing 6.4 billion objects • Fairly typical of astronomical data archives • Schema designed using 20 real queries • Personal version contains • Data for a specific region of the sky • About 0.1% of the data • About 500MB Image: SuperCOSMOS Science Archive A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Test Data • Using personal version • About 500MB in size (similar size to related work) • Organised in 14 Relations • Number of attributes: 2 – 152 • 4 relations with more than 20 attributes • Number of rows: 3 – 585,560 • Two views • Complex selection criteria in views Makes this different from business cases and previous work! A.J.G. Gray — IMG Seminar, University of Manchester
IsSPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? A.J.G. Gray — IMG Seminar, University of Manchester
Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SPARQL: SELECT ?ra ?decl ?sCorMagB?sCorMagR2 ?sCorMagI WHERE { …<bindings>… FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} SQL: SELECT ra, dec, sCorMagB,sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Test Queries A.J.G. Gray — IMG Seminar, University of Manchester
Expressivity of SPARQL Features Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting • Select-project-join • Arithmetic in body • Conjunction and disjunction • Ordering • String matching • External function calls (extension mechanism) A.J.G. Gray — IMG Seminar, University of Manchester
Analysis of Test Queries Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 A.J.G. Gray — IMG Seminar, University of Manchester
Can RDB2RDF tools feasibly expose large science archives for data integration? A.J.G. Gray — IMG Seminar, University of Manchester
Experiment • Time query evaluation • 5 out of 20 queries used • No joins • Systems compared: • Relational DB (Base line) • MySQL v5.1.25 • RDB2RDF tools • D2RQ v0.5.2 • SquirrelRDF v0.1 • RDF Triple stores • Jena v2.5.6 (SDB) • Sesame v2.1.3 (Native) SPARQL query SPARQL query SQL query RDB2RDF Relational DB Triple store Relational DB A.J.G. Gray — IMG Seminar, University of Manchester
Experimental Configuration • 8 identical machines • 64 bit Intel Quad Core Xeon 2.4GHz • 4GB RAM • 100 GB Hard drive • Java 1.6 • Linux • 10 repetitions A.J.G. Gray — IMG Seminar, University of Manchester
Performance Results 485,932 372,561 21,492 17,793 19,984 3,450 7,468 5,339 2,733 7,229 4,090 1,307 ms 1 A.J.G. Gray — IMG Seminar, University of Manchester
The Show Stopper: Query Translation • Each bound variable resulted in a self-join • RDBMS cannot optimize for this • RDBMS perform badly with self-joins • Each row retrieved with a separate query • 1 query becomes n queries, where n is cardinality of relation • Predicate selection in RDB2RDF tool • No RDBMS optimization possible A.J.G. Gray — IMG Seminar, University of Manchester
Extracting Relevant Data: Conclusions • SPARQL not expressive enough for real (astronomy) queries • RDBMS benefits from 30+ years research • Query optimisation • Indexes • RDF stores are improving • Require existing data to be replicated • RDB2RDF tools show promise • Need to exploit relational database A.J.G. Gray — IMG Seminar, University of Manchester
Can RDB2RDF Tools Feasible Expose Large Science Archives for Data Integration? Not currently! More work needed on query translation… A.J.G. Gray — IMG Seminar, University of Manchester
Conclusions & Future Work Traditional Integration Challenges Semantic Web Solution SKOS Vocabularies Search based on Terrier IR Platform Currently linking to resource content RDB2RDF Tools Requires improved query translation Semantic model mappings Follow “chains” of mappings Relies on RDB2RDF work • Locating data • Extracting relevant data • Understanding data A.J.G. Gray — IMG Seminar, University of Manchester