550 likes | 678 Views
Using Semantic Web Technology to Integrate Scientific Data. Alasdair J G Gray University of Manchester. Outline. Motivation: Astronomy & The Virtual Observatory Data Integration Challenges Locating the relevant data Requesting the required data Understanding the returned data.
E N D
Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester
Outline • Motivation: Astronomy & The Virtual Observatory • Data Integration Challenges • Locating the relevant data • Requesting the required data • Understanding the returned data Using Semantic Web tools and technology to overcome these challenges A.J.G. Gray - Bolzano-Bozen Seminar
Context: Astronomy • Data collected across electromagnetic spectrum • Analysed within one wavelength • Data collection is • expensive • time consuming • Existing data • large quantities • freely available Image: Wikipedia A.J.G. Gray - Bolzano-Bozen Seminar
Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” A.J.G. Gray - Bolzano-Bozen Seminar
Searching for Brown Dwarfs • Data sets: • Near Infrared, 2MASS/UK Infrared Deep Sky Survey • Optical, APMCAT/Sloan Digital Sky Survey • Complex colour/motion selection criteria • Similar problems • Halo White Dwarfs Image: AstroGrid A.J.G. Gray - Bolzano-Bozen Seminar
Deep Field Surveys • Observations in multiple wavelengths • Radio to X-Ray • Searching for new objects • Galaxies, stars, etc • Requires correlations across many catalogues • ISO • Hubble • SCUBA • etc A.J.G. Gray - Bolzano-Bozen Seminar
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Which data sources contain relevant data? • How do I query the relevant data sources? • How can I interpret/combine/analyse the data? Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar
Finding relevant data sources Which data sources contain relevant data? A.J.G. Gray - Bolzano-Bozen Seminar
Which data sources do I use? • VO registry • 65,000+ entries • Many mirrored services • VOExplorer • Registry search tool • Resources tagged with keywords • 6df • survey • galaxy • galaxies • redshift • redshifts • 2mass A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Registry Keywords Problems: • Plural/singular • Case • Abbreviations • Different tags • Specificity of tags Thanks to Sébastien Derriere for this data. 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Registry Keywords Problems: • Plural/singular • Case Solution: (standard IR techniques) • Stemming • Star & Stars become Star • Case normalisation • lowercase 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Registry Keywords Problems: • Abbreviations • Different tags • Specificity of tags Solution: Need to understand semantics! 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar
Semantic Options • Folksonomies • Keyword tags, freely chosen • Vocabulary • Controlled list of words with definitions • Taxonomy • Relationships: Broader/Narrower/Related • Thesaurus • Synonyms, antonyms, see also • Ontology • Formal specification of a shared conceptualisation – OWL “Vocabulary” used to covervocabularies, taxonomies, and thesauri. Image: Leonard Cohen Search A.J.G. Gray - Bolzano-Bozen Seminar
What is a Vocabulary? A.J.G. Gray - Bolzano-Bozen Seminar
What is a Controlled Vocabulary? • A set of terms with: • Label • Synonyms • Definition • Relationships to other terms: • Broader term • Narrower term • Related term • Example: • “Spiral galaxy” • “Spiral nebula” • “A galaxy having a spiral structure” • Relationships carrying semantic information: • BT: “Galaxy” • NT: “Barred spiral galaxy” • RT: “Spiral arm” A.J.G. Gray - Bolzano-Bozen Seminar
Existing Vocabularies in Astronomy • Journal Keywords • Developed for tagging papers • 311 terms • Actively used • Astronomy Visualization Metadata (AVM) • Tagging images • 217 terms • Actively used • IAU Thesaurus • Developed for libraries in 1993 • 2,551 terms • Never really used • Unified Content Descriptor (UCD) • Tagging resource data • 473 terms • Actively used A.J.G. Gray - Bolzano-Bozen Seminar
Common Vocabulary Format Requirements: • Provide term identifiers • Unambiguous tagging • Capture semantic relationships • Poly-hierarchy structure • Machine processable • Allows inter-operability • “Machine intelligence” • Avoids problems of: • Spelling • Case • Plurality problems • Tags • Automated reasoning: • Interested in all “Supernova” • Items tagged as “1a Supernova” also returned A.J.G. Gray - Bolzano-Bozen Seminar
SKOS • W3C standard for sharing vocabularies • Based on RDF • Semantic model for describing resources • Provides URI for each term • Captures properties of terms • Encodes relationships between terms • Enables automated reasoning • Standard serialisations • “Looser” semantics than OWL • Adopted by IVOA as a standard for vocabularies A.J.G. Gray - Bolzano-Bozen Seminar
Example SKOS Vocabulary Term Example “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” In turtle notation #spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm . A.J.G. Gray - Bolzano-Bozen Seminar
Inter-operable Vocabularies Which vocabulary should I use? Inter-vocabulary mappings Broad match: more general term Narrow match: more specific term Related match: associated term Exact match: equivalent term Close match: similar but not equivalent term • One that you know! • Closest match to your needs • Vocabulary terms related using mappings • Part of the SKOS standard • One mapping file per pair of vocabularies A.J.G. Gray - Bolzano-Bozen Seminar
Mapping Editor A.J.G. Gray - Bolzano-Bozen Seminar
Putting it all together • Use vocabulary concepts for • Tagging (using URI) • Resources in the registry • VOEvent packets • Searching by vocabulary concept • User keyword search converted to vocabulary URI • Provides semantic advantages • Reasoning about terms • Relationships (Intra-vocabulary) • Mappings (Inter-vocabulary) • Requires a mechanism to convert a string to a concept A.J.G. Gray - Bolzano-Bozen Seminar
Vocabulary Explorer • Search and browse vocabularies • Configure • Vocabularies • Mappings • Based on Information Retrieval techniques • Matching mechanisms • Ranking results http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/ A.J.G. Gray - Bolzano-Bozen Seminar
Search Results • Evaluation over 59 queries • nDCG evaluation model (distinguishes highly relevant/relevant/not relevant) A.J.G. Gray - Bolzano-Bozen Seminar
Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar
Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar
Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar
Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar
Finding the Right Term: Conclusions • Vocabularies improve search • Remove ambiguity • Increase precision and recall • Enable • Reasoning about relevance • Faceted browsing • Provided tools for working with vocabularies • Reliable search from keyword string to vocabulary term • Exploration of vocabularies • Mapping terms across vocabularies A.J.G. Gray - Bolzano-Bozen Seminar
Extracting relevant data How do I query the relevant data sources? A.J.G. Gray - Bolzano-Bozen Seminar
Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar
A Data Integration Approach Query1 • Heterogeneous sources • Autonomous • Local schemas • Homogeneous view • Mediated global schema • Mapping • LAV: local-as-view • GAV: global-as-view Queryn Relies on agreement of a common global schema Global Schema Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - Bolzano-Bozen Seminar
P2P Data Integration Approach Query1 Queryn • Heterogeneous sources • Autonomous • Local schemas • Heterogeneous views • Multiple schemas • Mappings • From sources to common schema • Between pairs of schema • Require common integration data model Can RDF do this? Schemaj Schema1 Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - Bolzano-Bozen Seminar
Resource Description Framework IAU:Star • W3C standard • Designed as a metadata data model • Contains semantic details • Ideal for linking distributed data • Queried through SPARQL rdf:type #Sun #name The Sun #foundIn The Galaxy #name #MilkyWay #name Milky Way rdf:type IAU:BarredSpiral A.J.G. Gray - Bolzano-Bozen Seminar
SPARQL • Declarative query language • Select returned data (projection) • Graph or tuples • Attributes to return • Describe structure of desired results • Filter data (selection) • W3C standard • Syntactically similar to SQL • Should be easy for astronomers to learn! A.J.G. Gray - Bolzano-Bozen Seminar
Integrating Using RDF Common Model (RDF) • Data resources • Expose schema and data as RDF • Need a SPARQL endpoint • Allows multiple • Access models • Storage models • Easy to relate data from multiple sources SPARQL query We will focus on exposing relational data sources Mappings RDF / Relational Conversion RDF / XML Conversion Relational DB XML DB A.J.G. Gray - Bolzano-Bozen Seminar
RDB2RDF: Two Approaches Extract-Transform-Load Query-driven Conversion Data stored as relations Native SQL query support Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF • Data replicated as RDF • Data can become stale • Native SPARQL query support • Limited optimisation mechanisms Existing RDF stores • Jena • Seasame A.J.G. Gray - Bolzano-Bozen Seminar
System Test Hypothesis Common Model (RDF) Is it viable to perform query-driven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? SPARQL query SPARQL query Mappings RDB2RDF RDF / XML Conversion Relational DB XML DB A.J.G. Gray - Bolzano-Bozen Seminar
Astronomical Test Data Set • SuperCOSMOS Science Archive (SSA) • Data extracted from scans of Schmidt plates • Stored in a relational database • About 4TB of data, detailing 6.4 billion objects • Fairly typical of astronomical data archives • Schema designed using 20 real queries • Personal version contains • Data for a specific region of the sky • About 0.1% of the data • About 500MB Image: SuperCOSMOS Science Archive A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Test Data • Using personal version • About 500MB in size (similar size to related work) • Organised in 14 Relations • Number of attributes: 2 – 152 • 4 relations with more than 20 attributes • Number of rows: 3 – 585,560 • Two views • Complex selection criteria in views Makes this different from business cases and previous work! A.J.G. Gray - Bolzano-Bozen Seminar
IsSPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? A.J.G. Gray - Bolzano-Bozen Seminar
Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SPARQL: SELECT ?ra ?decl ?sCorMagB?sCorMagR2 ?sCorMagI WHERE { …<bindings>… FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} SQL: SELECT ra, dec, sCorMagB,sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Test Queries A.J.G. Gray - Bolzano-Bozen Seminar
Expressivity of SPARQL Features Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting • Select-project-join • Arithmetic in body • Conjunction and disjunction • Ordering • String matching • External function calls (extension mechanism) A.J.G. Gray - Bolzano-Bozen Seminar
Analysis of Test Queries Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 A.J.G. Gray - Bolzano-Bozen Seminar
Can RDB2RDF tools feasibly expose large science archives for data integration? A.J.G. Gray - Bolzano-Bozen Seminar
Experiment • Time query evaluation • 5 out of 20 queries used • No joins • Systems compared: • Relational DB (Base line) • MySQL v5.1.25 • RDB2RDF tools • D2RQ v0.5.2 • SquirrelRDF v0.1 • RDF Triple stores • Jena v2.5.6 (SDB) • Sesame v2.1.3 (Native) SPARQL query SPARQL query SQL query RDB2RDF Relational DB Triple store Relational DB A.J.G. Gray - Bolzano-Bozen Seminar
Experimental Configuration • 8 identical machines • 64 bit Intel Quad Core Xeon 2.4GHz • 4GB RAM • 100 GB Hard drive • Java 1.6 • Linux • 10 repetitions A.J.G. Gray - Bolzano-Bozen Seminar
Performance Results 485,932 372,561 21,492 17,793 19,984 3,450 5,339 2,733 7,229 4,090 1,307 7,468 ms 1 A.J.G. Gray - Bolzano-Bozen Seminar