1 / 54

Using Semantic Web Technology to Integrate Scientific Data

Using Semantic Web Technology to Integrate Scientific Data. Alasdair J G Gray University of Manchester. Outline. Motivation: Astronomy & The Virtual Observatory Data Integration Challenges Locating the relevant data Requesting the required data Understanding the returned data.

aleron
Download Presentation

Using Semantic Web Technology to Integrate Scientific Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Semantic Web Technology to Integrate Scientific Data Alasdair J G Gray University of Manchester

  2. Outline • Motivation: Astronomy & The Virtual Observatory • Data Integration Challenges • Locating the relevant data • Requesting the required data • Understanding the returned data Using Semantic Web tools and technology to overcome these challenges A.J.G. Gray - Bolzano-Bozen Seminar

  3. Context: Astronomy • Data collected across electromagnetic spectrum • Analysed within one wavelength • Data collection is • expensive • time consuming • Existing data • large quantities • freely available Image: Wikipedia A.J.G. Gray - Bolzano-Bozen Seminar

  4. Virtual Observatory “facilitate the international coordination and collaboration necessary for the development and deployment of the tools, systems and organizational structures necessary to enable the international utilization of astronomical archives as an integrated and interoperating virtual observatory.” A.J.G. Gray - Bolzano-Bozen Seminar

  5. Searching for Brown Dwarfs • Data sets: • Near Infrared, 2MASS/UK Infrared Deep Sky Survey • Optical, APMCAT/Sloan Digital Sky Survey • Complex colour/motion selection criteria • Similar problems • Halo White Dwarfs Image: AstroGrid A.J.G. Gray - Bolzano-Bozen Seminar

  6. Deep Field Surveys • Observations in multiple wavelengths • Radio to X-Ray • Searching for new objects • Galaxies, stars, etc • Requires correlations across many catalogues • ISO • Hubble • SCUBA • etc A.J.G. Gray - Bolzano-Bozen Seminar

  7. Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar

  8. Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Which data sources contain relevant data? • How do I query the relevant data sources? • How can I interpret/combine/analyse the data? Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar

  9. Finding relevant data sources Which data sources contain relevant data? A.J.G. Gray - Bolzano-Bozen Seminar

  10. Which data sources do I use? • VO registry • 65,000+ entries • Many mirrored services • VOExplorer • Registry search tool • Resources tagged with keywords • 6df • survey • galaxy • galaxies • redshift • redshifts • 2mass A.J.G. Gray - Bolzano-Bozen Seminar

  11. Analysis of Registry Keywords Problems: • Plural/singular • Case • Abbreviations • Different tags • Specificity of tags Thanks to Sébastien Derriere for this data. 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar

  12. Analysis of Registry Keywords Problems: • Plural/singular • Case Solution: (standard IR techniques) • Stemming • Star & Stars become Star • Case normalisation • lowercase 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar

  13. Analysis of Registry Keywords Problems: • Abbreviations • Different tags • Specificity of tags Solution: Need to understand semantics! 75 Star 52 Galaxy 37 Stars 36 Galaxies 16 AGN 12 Cluster of Galaxies 12 Nebulae 11 Planets 10 GRB 10 Globular Clusters 8 Star Cluster 7 Nebula 6 Variable stars 5 Hot stars 5 Pulsar 4 supernova 3 Clusters of Galaxies 3 Infrared:stars 3 Quasars: general 3 Supernova 3 White dwarfs 3 galaxies 2 Comets 2 Cool stars 2 Extragalactic Source 2 Extragalactic objects 2 Infrared: stars 2 Interstellar medium 2 QSO 2 QSOs 2 SNR 2 Variable Star 2 White Dwarf 2 clusters of galaxies 2 stars 1 Asteroids 1 BL Lac 1 Be/X-ray binary stars 1 Binary stars ... A.J.G. Gray - Bolzano-Bozen Seminar

  14. Semantic Options • Folksonomies • Keyword tags, freely chosen • Vocabulary • Controlled list of words with definitions • Taxonomy • Relationships: Broader/Narrower/Related • Thesaurus • Synonyms, antonyms, see also • Ontology • Formal specification of a shared conceptualisation – OWL “Vocabulary” used to covervocabularies, taxonomies, and thesauri. Image: Leonard Cohen Search A.J.G. Gray - Bolzano-Bozen Seminar

  15. What is a Vocabulary? A.J.G. Gray - Bolzano-Bozen Seminar

  16. What is a Controlled Vocabulary? • A set of terms with: • Label • Synonyms • Definition • Relationships to other terms: • Broader term • Narrower term • Related term • Example: • “Spiral galaxy” • “Spiral nebula” • “A galaxy having a spiral structure” • Relationships carrying semantic information: • BT: “Galaxy” • NT: “Barred spiral galaxy” • RT: “Spiral arm” A.J.G. Gray - Bolzano-Bozen Seminar

  17. Existing Vocabularies in Astronomy • Journal Keywords • Developed for tagging papers • 311 terms • Actively used • Astronomy Visualization Metadata (AVM) • Tagging images • 217 terms • Actively used • IAU Thesaurus • Developed for libraries in 1993 • 2,551 terms • Never really used • Unified Content Descriptor (UCD) • Tagging resource data • 473 terms • Actively used A.J.G. Gray - Bolzano-Bozen Seminar

  18. Common Vocabulary Format Requirements: • Provide term identifiers • Unambiguous tagging • Capture semantic relationships • Poly-hierarchy structure • Machine processable • Allows inter-operability • “Machine intelligence” • Avoids problems of: • Spelling • Case • Plurality problems • Tags • Automated reasoning: • Interested in all “Supernova” • Items tagged as “1a Supernova” also returned A.J.G. Gray - Bolzano-Bozen Seminar

  19. SKOS • W3C standard for sharing vocabularies • Based on RDF • Semantic model for describing resources • Provides URI for each term • Captures properties of terms • Encodes relationships between terms • Enables automated reasoning • Standard serialisations • “Looser” semantics than OWL • Adopted by IVOA as a standard for vocabularies A.J.G. Gray - Bolzano-Bozen Seminar

  20. Example SKOS Vocabulary Term Example “Spiral galaxy” “Spiral nebula” “A galaxy having a spiral structure” Relationships: BT: “Galaxy” NT: “Barred spiral galaxy” RT: “Spiral arm” In turtle notation #spiralGalaxy a concept; prefLabel “Spiral galaxy”@en; altLabel “Spiral nebula”@en; definition “A galaxy having a spiral structure”@en; broader #galaxy; narrower #barredSpiralGalaxy; related #spiralArm . A.J.G. Gray - Bolzano-Bozen Seminar

  21. Inter-operable Vocabularies Which vocabulary should I use? Inter-vocabulary mappings Broad match: more general term Narrow match: more specific term Related match: associated term Exact match: equivalent term Close match: similar but not equivalent term • One that you know! • Closest match to your needs • Vocabulary terms related using mappings • Part of the SKOS standard • One mapping file per pair of vocabularies A.J.G. Gray - Bolzano-Bozen Seminar

  22. Mapping Editor A.J.G. Gray - Bolzano-Bozen Seminar

  23. Putting it all together • Use vocabulary concepts for • Tagging (using URI) • Resources in the registry • VOEvent packets • Searching by vocabulary concept • User keyword search converted to vocabulary URI • Provides semantic advantages • Reasoning about terms • Relationships (Intra-vocabulary) • Mappings (Inter-vocabulary) • Requires a mechanism to convert a string to a concept A.J.G. Gray - Bolzano-Bozen Seminar

  24. Vocabulary Explorer • Search and browse vocabularies • Configure • Vocabularies • Mappings • Based on Information Retrieval techniques • Matching mechanisms • Ranking results http://explicator.dcs.gla.ac.uk/WebVocabularyExplorer/ A.J.G. Gray - Bolzano-Bozen Seminar

  25. Search Results • Evaluation over 59 queries • nDCG evaluation model (distinguishes highly relevant/relevant/not relevant) A.J.G. Gray - Bolzano-Bozen Seminar

  26. Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar

  27. Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar

  28. Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar

  29. Vocabulary Explorer Screenshot A.J.G. Gray - Bolzano-Bozen Seminar

  30. Finding the Right Term: Conclusions • Vocabularies improve search • Remove ambiguity • Increase precision and recall • Enable • Reasoning about relevance • Faceted browsing • Provided tools for working with vocabularies • Reliable search from keyword string to vocabulary term • Exploration of vocabularies • Mapping terms across vocabularies A.J.G. Gray - Bolzano-Bozen Seminar

  31. Extracting relevant data How do I query the relevant data sources? A.J.G. Gray - Bolzano-Bozen Seminar

  32. Virtual Observatory: The Problems Locate, retrieve, and interpret relevant data • Heterogeneous publishers • Archive centres • Research labs • Heterogeneous data • Relational • XML • Image Files Virtual Observatory A.J.G. Gray - Bolzano-Bozen Seminar

  33. A Data Integration Approach Query1 • Heterogeneous sources • Autonomous • Local schemas • Homogeneous view • Mediated global schema • Mapping • LAV: local-as-view • GAV: global-as-view Queryn Relies on agreement of a common global schema Global Schema Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - Bolzano-Bozen Seminar

  34. P2P Data Integration Approach Query1 Queryn • Heterogeneous sources • Autonomous • Local schemas • Heterogeneous views • Multiple schemas • Mappings • From sources to common schema • Between pairs of schema • Require common integration data model Can RDF do this? Schemaj Schema1 Mappings Wrapper1 Wrapperk Wrapperi DB1 DBk DBi A.J.G. Gray - Bolzano-Bozen Seminar

  35. Resource Description Framework IAU:Star • W3C standard • Designed as a metadata data model • Contains semantic details • Ideal for linking distributed data • Queried through SPARQL rdf:type #Sun #name The Sun #foundIn The Galaxy #name #MilkyWay #name Milky Way rdf:type IAU:BarredSpiral A.J.G. Gray - Bolzano-Bozen Seminar

  36. SPARQL • Declarative query language • Select returned data (projection) • Graph or tuples • Attributes to return • Describe structure of desired results • Filter data (selection) • W3C standard • Syntactically similar to SQL • Should be easy for astronomers to learn! A.J.G. Gray - Bolzano-Bozen Seminar

  37. Integrating Using RDF Common Model (RDF) • Data resources • Expose schema and data as RDF • Need a SPARQL endpoint • Allows multiple • Access models • Storage models • Easy to relate data from multiple sources SPARQL query We will focus on exposing relational data sources Mappings RDF / Relational Conversion RDF / XML Conversion Relational DB XML DB A.J.G. Gray - Bolzano-Bozen Seminar

  38. RDB2RDF: Two Approaches Extract-Transform-Load Query-driven Conversion Data stored as relations Native SQL query support Highly optimised access methods SPARQL queries must be translated Existing translation systems D2RQ SquirrelRDF • Data replicated as RDF • Data can become stale • Native SPARQL query support • Limited optimisation mechanisms Existing RDF stores • Jena • Seasame A.J.G. Gray - Bolzano-Bozen Seminar

  39. System Test Hypothesis Common Model (RDF) Is it viable to perform query-driven conversions to facilitate data access from a data model that an astronomer is familiar with? Can RDB2RDF tools feasibly expose large science archives for data integration? SPARQL query SPARQL query Mappings RDB2RDF RDF / XML Conversion Relational DB XML DB A.J.G. Gray - Bolzano-Bozen Seminar

  40. Astronomical Test Data Set • SuperCOSMOS Science Archive (SSA) • Data extracted from scans of Schmidt plates • Stored in a relational database • About 4TB of data, detailing 6.4 billion objects • Fairly typical of astronomical data archives • Schema designed using 20 real queries • Personal version contains • Data for a specific region of the sky • About 0.1% of the data • About 500MB Image: SuperCOSMOS Science Archive A.J.G. Gray - Bolzano-Bozen Seminar

  41. Analysis of Test Data • Using personal version • About 500MB in size (similar size to related work) • Organised in 14 Relations • Number of attributes: 2 – 152 • 4 relations with more than 20 attributes • Number of rows: 3 – 585,560 • Two views • Complex selection criteria in views Makes this different from business cases and previous work! A.J.G. Gray - Bolzano-Bozen Seminar

  42. IsSPARQL expressive enough? Can the 20 sample queries be expressed in SPARQL? A.J.G. Gray - Bolzano-Bozen Seminar

  43. Real Science Queries Query 5: Find the positions and (B,R,I) magnitudes of all star-like objects within delta mag of 0.2 of the colours of a quasar of redshift 2.5 < z < 3.5 SPARQL: SELECT ?ra ?decl ?sCorMagB?sCorMagR2 ?sCorMagI WHERE { …<bindings>… FILTER (?sCorMagB – ?sCorMagR2 >= 0.05 && ?sCorMagB - ?sCorMagR2 <= 0.80) FILTER (?sCorMagR2 – ?sCorMagI >= -0.17 && ?sCorMagR2 - ?sCorMagI <= 0.64)} SQL: SELECT ra, dec, sCorMagB,sCorMagR2, sCorMagI FROM ReliableStars WHERE (sCorMagB-sCorMagR2 BETWEEN 0.05 AND 0.80) AND (sCorMagR2-sCorMagI BETWEEN -0.17 AND 0.64) A.J.G. Gray - Bolzano-Bozen Seminar

  44. Analysis of Test Queries A.J.G. Gray - Bolzano-Bozen Seminar

  45. Expressivity of SPARQL Features Limitations Range shorthands Arithmetic in head Math functions Trigonometry functions Sub queries Aggregate functions Casting • Select-project-join • Arithmetic in body • Conjunction and disjunction • Ordering • String matching • External function calls (extension mechanism) A.J.G. Gray - Bolzano-Bozen Seminar

  46. Analysis of Test Queries Expressible queries: 1, 2, 3, 5, 6, 14, 15, 17, 19 A.J.G. Gray - Bolzano-Bozen Seminar

  47. Can RDB2RDF tools feasibly expose large science archives for data integration? A.J.G. Gray - Bolzano-Bozen Seminar

  48. Experiment • Time query evaluation • 5 out of 20 queries used • No joins • Systems compared: • Relational DB (Base line) • MySQL v5.1.25 • RDB2RDF tools • D2RQ v0.5.2 • SquirrelRDF v0.1 • RDF Triple stores • Jena v2.5.6 (SDB) • Sesame v2.1.3 (Native) SPARQL query SPARQL query SQL query RDB2RDF Relational DB Triple store Relational DB A.J.G. Gray - Bolzano-Bozen Seminar

  49. Experimental Configuration • 8 identical machines • 64 bit Intel Quad Core Xeon 2.4GHz • 4GB RAM • 100 GB Hard drive • Java 1.6 • Linux • 10 repetitions A.J.G. Gray - Bolzano-Bozen Seminar

  50. Performance Results 485,932 372,561 21,492 17,793 19,984 3,450 5,339 2,733 7,229 4,090 1,307 7,468 ms 1 A.J.G. Gray - Bolzano-Bozen Seminar

More Related