150 likes | 228 Views
Using Ontologies to Enable Access to Multiple Heterogeneous Databases CARDGIS. Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University). Context: CARDGIS Project. Sources: Energy Info. Adminstration (quarterly CD ROM).
E N D
Using Ontologies to Enable Access to Multiple Heterogeneous DatabasesCARDGIS Eduard Hovy Information Sciences Institute University of Southern California (in collaboration with Columbia University)
Context: CARDGIS Project • Sources: • Energy Info. Adminstration (quarterly CD ROM). • Bureau of Labor Statistics (http://stats.bls.gov). • Census Bureau (CD ROM for 1992 data). • California Energy Commission (weekly data at http://energy.ca.gov). Enable access to multiple, heterogeneous Federal agency data sources through single interface using standardized nomenclature, while accounting for semantic variability. CARDGIS
Construction • phase: • Deploy DBs • Extend ontol. Integrated Ontology - global terminology - source descriptions - integration axioms User Interface - ontology browser - query constructor • User phase: • Compose query Query Processor - reformulation - cost optimization Ontology Construction - DB analysis - text analysis R S T Sources • Access phase: • Create DB query • Retrieve data System Architecture CARDGIS
So What is an Ontology? • Desiderata: • ‘anchor points’ for terminology variants (salary, income…), • wide coverage, • some degree of taxonomic organization for inference/program behavior control. • Terminological (not domain) ontology. CARDGIS
ISI’s SENSUS Ontology • Taxonomy, multiple superclass links. • Approx. 90,000 items. • Top level: Penman Upper Model (ISI). • Body: WordNet (Princeton), rearranged. • Used at ISI for machine translation, text summarization, database access. http://vigor.isi.edu:8002/sensus2/ CARDGIS
+ + 3 Ways of Building Ontologies 1.Combine existing knowledge resources: ontology alignment. 2.Learn from texts and Web: extract word families for thousands of concepts. 3.Parse dictionary definitions: extract information and place into ontology. CARDGIS
1. Cross-Ontology Alignment Why create a new Ontology? — Merge and re-use existing ones! Problem: automatically find corresp. concepts. 1.Text Matches • concept names (cognates; reward for delimiter confluence...) • textual definitions (string matching, demorphing, stop words...) [Knight & Luk 94, Dalianis & Hovy 98] 2. Hierarchy Matches • shared superconcepts, to filter ambiguity [Knight & Luk 94] • semantic distance [Agirre et al. 94] 3. Data Item and Form Matches • inter-concept relations [Ageno et al. 94; Rigau & Agirre 95] • slot-filler restrictions [Okumura & Hovy 94] CARDGIS
1996 1997 Cross-Ontology Alignment Results • Ontologies: • SENSUS Upper Model (350) • CYC top region (2400) [Lenat; Lehmann 96] • MIKROKOSMOS (4790 concepts) [Mahesh 96] • SENSUS top region (6768) • Recall (how many links were missed?): difficult to count! … 32.4 mill pairs • Precision (how many suggested links are correct?): • 0.252 (strict) • 0.517 (lenient) • After 5 runs:correct: 244 (= 3.6%) • 883 suggestions near miss: 256 (= 3.8%) (= 13% of SENSUS candidates) wrong: 383 (= 5.6%) CARDGIS
2. The Websucker • Corpus • Training set WSJ 1987: • 16,137 texts (32 topics). • Test set WSJ 1988: • 12,906 texts (31 topics). • Texts indexed into categories by humans. • Signature data • 300 terms each, using tf.idf . • Word forms: single words, demorphed words, multi-word phrases. • How many terms in signatures? • 5,10,15, …, 300 terms. CARDGIS
<MORTICE,w=33.7982> <WOODWORKING, w=20.9227> <TENNON, w=20.9227> <JOINERY, w=17.7038> <WOOD, w=15.8356> <HARDWOOD, w=14.4849> <JASON, w=14.4849> <DOTH, w=12.8755> <BRASH, w=12.8755> <OAK, w=12.8281> <WEDGE, w=11.9118> <FURNITURE, w=10.0792> <TOOL, w=9.19486> <SHAFT, w=8.17321> <STAR, w=75.1358> <ORION,w=55.8937> <PYRAMID,w=42.1494> <DNA,w=41.2331> <SOUL,w=31.1539> <IMPLOSION,w=23.8236> <KHUFU,w=19.3133> <GOLD,w=18.3897> <RECURSION,w=18.3258> <BELLATRIX,w=17.7038> <OSIRIS,w=17.7038> <PHI,w=16.4932> <EMBED,w=16.4932> <MAGNETIC,w=16.4932> <AIRCRAFT, w=207.998> <ENGINE, w=178.677> <WING, w=138.36> <PROPELLER, w=122.317> <FLY, w=103.187> <AIRPLANE, w=98.0431> <AVIATION, w=96.5663> <FLIGHT, w=85.3079> <AIR, w=80.1996> <WARBIRDS, w=72.4247> <PILOT, w=71.4707> <MPH, w=65.987> <CONTROL, w=65.9729> <FUEL, w=62.3078> Pollution on the Web • Cleanup: try various methods: tf.idf, c2, Latent Semantic Analysis... CARDGIS
3. Dictionary Extraction Step 1: find unencumbered dictionary (Webster 1913). Step 2: reformat and then parse entries (http://www.isi.edu/natural-language/dpp/). <hw>Babel</hw> <pos>n</pos> <sn>2</sn> [ SENT [ NP OR [ NP A/DT place/NN ] [ NP scene/NN ] ] [ PP of/IN [ NP AND [ NP noise/NN ] [ NP confusion/NN ] ] ] ] ;/: [ SENT [ NP a/DT confused/JJ mixture/NN ] [ PP of/IN [ NP sounds/NNS ] ] ,/, as/IN [ PP of/IN [ NP languages/NNS ] ] ] ./. Step 3: identify individual propositions and their heads. Step 4: convert preps to semantic relations (EM alg). Step 5: place entries into ontology (not yet done). CARDGIS
Identify propositions and their parts: Impression: “A communicating [of a mold or trait] [by an external force or influence]” Reflection: “The return [of light or sound waves] [by or as if by a mirror]” by = AGENT or PATH? communication by force; return by mirror; return by road of = OWNER or NUMBER-PART or SOURCE or …? the car of Joe; 1 of 15 people smoke; man of La Mancha • Apply EM algorithm to disambiguate. Disambiguating Extracted Info. CARDGIS
Dictionary Extraction Results Evaluation for sentence #1: "As a prefix to english words." 0.000000000621871299: NIL relation<abst PHRASAL speech_act Score: 1/1 = 1 Evaluation for sentence #13: "Gives up to underwriters." 0.000000041080864587: create,make NIL RECIPIENT capitalist<so 0.000000038652300894: transmit_thou NIL RECIPIENT capitalist<so Score: 1/2 = 0.5 Evaluation for sentence #14: "Gives all claim to the property." 0.000000002594561718: emit,utter human_action PHRASAL possessn>tr 0.000000002564569212: chnge_pos human_action PHRASAL possessn>tr 0.000000002451809783: create,make human_action PHRASAL possessn>tr 0.000000002368122454: cogitate human_action PHRASAL possessn>tr 0.000000002366411877: utilize human_action PHRASAL possessn>tr 0.000000002307022303: transmit_thou human_act PHRASAL possessn>tr 0.000000002177555675: transfer>comm human_act PHRASAL possessn>tr 0.000000002049017956: chnge>go_mad human_act PHRASA possessn>tr Score: 1/8 = 0.125 Ambiguity reduction Readings Instances 60 1 48 1 36 1 24 1 18 7 12 8 10 2 6 764 5 12 4 20 3 108 2 310 1 902 CARDGIS
The Future: Terminology Standard? Reasons for terminology standardization: • 1.Non-duplication • similar domain models built for many applications • 2. Consistency • across experts within domain, and across domains • 3.Efficient model building • complex: many decisions required simultaneously ANSI Ad Hoc Group on Ontology Standards (NCITS): draw together Ontology work worldwide IBM (Santa Teresa), Stanford, ISI, CYC, TextWise, EDR, CSLI, NMSU, Lawrence Livermore, OnTek, Government... Meetings: 3/96, 9/96, 3/97, 11/97, 1/98, (6/98)… CARDGIS
Questions? CARDGIS