1 / 15

Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS. Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University). Context: CARDGIS Project. Sources: Energy Info. Adminstration (quarterly CD ROM).

thea
Download Presentation

Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Ontologies to Enable Access to Multiple Heterogeneous DatabasesCARDGIS Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University)

  2. Context: CARDGIS Project • Sources: • Energy Info. Adminstration (quarterly CD ROM). • Bureau of Labor Statistics (http://stats.bls.gov). • Census Bureau (CD ROM for 1992 data). • California Energy Commission (weekly data at http://energy.ca.gov). Enable access to multiple, heterogeneous Federal agency data sources through single interface using standardized nomenclature, while accounting for semantic variability. CARDGIS

  3. Construction • phase: • Deploy DBs • Extend ontol. Integrated Ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor • User phase: • Compose query Query Processor - reformulation - cost optimization  Ontology Construction - DB analysis - text analysis R S T Sources • Access phase: • Create DB query • Retrieve data System Architecture CARDGIS

  4. So What is an Ontology? • Desiderata: • ‘anchor points’ for terminology variants (salary, income…), • wide coverage, • some degree of taxonomic organization for inference/program behavior control. • Terminological (not domain) ontology. CARDGIS

  5. ISI’s SENSUS Ontology • Taxonomy, multiple superclass links. • Approx. 90,000 items. • Top level: Penman Upper Model (ISI). • Body: WordNet (Princeton), rearranged. • Used at ISI for machine translation, text summarization, database access. http://vigor.isi.edu:8002/sensus2/ CARDGIS

  6. + + 3 Ways of Building Ontologies 1.Combine existing knowledge resources: ontology alignment. 2.Learn from texts and Web: extract word families for thousands of concepts. 3.Parse dictionary definitions: extract information and place into ontology. CARDGIS

  7. 1. Cross-Ontology Alignment Why create a new Ontology? — Merge and re-use existing ones! Problem: automatically find corresp. concepts. 1.Text Matches • concept names (cognates; reward for delimiter confluence...) • textual definitions (string matching, demorphing, stop words...) [Knight & Luk 94, Dalianis & Hovy 98] 2. Hierarchy Matches • shared superconcepts, to filter ambiguity [Knight & Luk 94] • semantic distance [Agirre et al. 94] 3. Data Item and Form Matches • inter-concept relations [Ageno et al. 94; Rigau & Agirre 95] • slot-filler restrictions [Okumura & Hovy 94] CARDGIS

  8. 1996 1997 Cross-Ontology Alignment Results • Ontologies: • SENSUS Upper Model (350) • CYC top region (2400) [Lenat; Lehmann 96] • MIKROKOSMOS (4790 concepts) [Mahesh 96] • SENSUS top region (6768) • Recall (how many links were missed?): difficult to count! … 32.4 mill pairs • Precision (how many suggested links are correct?): • 0.252 (strict) • 0.517 (lenient) • After 5 runs:correct: 244 (= 3.6%) • 883 suggestions near miss: 256 (= 3.8%) (= 13% of SENSUS candidates) wrong: 383 (= 5.6%) CARDGIS

  9. 2. The Websucker • Corpus • Training set WSJ 1987: • 16,137 texts (32 topics). • Test set WSJ 1988: • 12,906 texts (31 topics). • Texts indexed into categories by humans. • Signature data • 300 terms each, using tf.idf . • Word forms: single words, demorphed words, multi-word phrases. • How many terms in signatures? • 5,10,15, …, 300 terms. CARDGIS

  10. <MORTICE,w=33.7982> <WOODWORKING, w=20.9227> <TENNON, w=20.9227> <JOINERY, w=17.7038> <WOOD, w=15.8356> <HARDWOOD, w=14.4849> <JASON, w=14.4849> <DOTH, w=12.8755> <BRASH, w=12.8755> <OAK, w=12.8281> <WEDGE, w=11.9118> <FURNITURE, w=10.0792> <TOOL, w=9.19486> <SHAFT, w=8.17321> <STAR, w=75.1358> <ORION,w=55.8937> <PYRAMID,w=42.1494> <DNA,w=41.2331> <SOUL,w=31.1539> <IMPLOSION,w=23.8236> <KHUFU,w=19.3133> <GOLD,w=18.3897> <RECURSION,w=18.3258> <BELLATRIX,w=17.7038> <OSIRIS,w=17.7038> <PHI,w=16.4932> <EMBED,w=16.4932> <MAGNETIC,w=16.4932> <AIRCRAFT, w=207.998> <ENGINE, w=178.677> <WING, w=138.36> <PROPELLER, w=122.317> <FLY, w=103.187> <AIRPLANE, w=98.0431> <AVIATION, w=96.5663> <FLIGHT, w=85.3079> <AIR, w=80.1996> <WARBIRDS, w=72.4247> <PILOT, w=71.4707> <MPH, w=65.987> <CONTROL, w=65.9729> <FUEL, w=62.3078> Pollution on the Web • Cleanup: try various methods: tf.idf, c2, Latent Semantic Analysis... CARDGIS

  11. 3. Dictionary Extraction Step 1: find unencumbered dictionary (Webster 1913). Step 2: reformat and then parse entries (http://www.isi.edu/natural-language/dpp/). <hw>Babel</hw> <pos>n</pos> <sn>2</sn> [ SENT [ NP OR [ NP A/DT place/NN ] [ NP scene/NN ] ] [ PP of/IN [ NP AND [ NP noise/NN ] [ NP confusion/NN ] ] ] ] ;/: [ SENT [ NP a/DT confused/JJ mixture/NN ] [ PP of/IN [ NP sounds/NNS ] ] ,/, as/IN [ PP of/IN [ NP languages/NNS ] ] ] ./. Step 3: identify individual propositions and their heads. Step 4: convert preps to semantic relations (EM alg). Step 5: place entries into ontology (not yet done). CARDGIS

  12. Identify propositions and their parts: Impression: “A communicating [of a mold or trait] [by an external force or influence]” Reflection: “The return [of light or sound waves] [by or as if by a mirror]” by = AGENT or PATH? communication by force; return by mirror; return by road of = OWNER or NUMBER-PART or SOURCE or …? the car of Joe; 1 of 15 people smoke; man of La Mancha • Apply EM algorithm to disambiguate. Disambiguating Extracted Info. CARDGIS

  13. Dictionary Extraction Results Evaluation for sentence #1: "As a prefix to english words." 0.000000000621871299: NIL relation<abst PHRASAL speech_act Score: 1/1 = 1 Evaluation for sentence #13: "Gives up to underwriters." 0.000000041080864587: create,make NIL RECIPIENT capitalist<so 0.000000038652300894: transmit_thou NIL RECIPIENT capitalist<so Score: 1/2 = 0.5 Evaluation for sentence #14: "Gives all claim to the property." 0.000000002594561718: emit,utter human_action PHRASAL possessn>tr 0.000000002564569212: chnge_pos human_action PHRASAL possessn>tr 0.000000002451809783: create,make human_action PHRASAL possessn>tr 0.000000002368122454: cogitate human_action PHRASAL possessn>tr 0.000000002366411877: utilize human_action PHRASAL possessn>tr 0.000000002307022303: transmit_thou human_act PHRASAL possessn>tr 0.000000002177555675: transfer>comm human_act PHRASAL possessn>tr 0.000000002049017956: chnge>go_mad human_act PHRASA possessn>tr Score: 1/8 = 0.125 Ambiguity reduction Readings Instances 60 1 48 1 36 1 24 1 18 7 12 8 10 2 6 764 5 12 4 20 3 108 2 310 1 902 CARDGIS

  14. The Future: Terminology Standard? Reasons for terminology standardization: • 1.Non-duplication • similar domain models built for many applications • 2. Consistency • across experts within domain, and across domains • 3.Efficient model building • complex: many decisions required simultaneously ANSI Ad Hoc Group on Ontology Standards (NCITS): draw together Ontology work worldwide IBM (Santa Teresa), Stanford, ISI, CYC, TextWise, EDR, CSLI, NMSU, Lawrence Livermore, OnTek, Government... Meetings: 3/96, 9/96, 3/97, 11/97, 1/98, (6/98)… CARDGIS

  15. Questions? CARDGIS

More Related