500 likes | 519 Views
This paper discusses the use of the SALT project to enhance ontology generation by integrating large-scale termbases with ontological structures and converting them into machine-readable XML format.
E N D
Peppering knowledge sources with SALT (Boosting conceptual content for ontology generation) Deryle Lonsdale, Yihong Ding, David W. Embley, Alan Melby Brigham Young University lonz@byu.edu, {ding,embley}@cs.byu.edu, akm@byu.edu AAAI 2002 WS
Acknowledgements • Co-authors (Embley, Ding) • EU Fifth Framework IST/HLT 3.4.1 • NSF Information and Intelligent Systems grant IIS-0083127 • Gerhard Budin (Eurodicautom data) • Sergei Nirenburg (Mikrokosmos ontology) AAAI 2002 WS
Outline • Termbases and lexicons: (re)use(s) • The SALT and TIDIE projects • Data modeling and data resources • Termbase conversion • Ontology generation • Results and evaluation • Conclusions AAAI 2002 WS
Termbases • Terminology databases for humans in multilingual documentation industry • Several models, formats; often concept-oriented in nature • Termium, Eurodicautom, etc. AAAI 2002 WS
Lexicons • NLP applications: IR, MT, NLU, speech understanding • Widely varying data formats • Description at various levels of linguistic theory AAAI 2002 WS
Sharing resources • Integration is the trend • Lexicons (OLIF for MT system lexicons) • Termbases (MARTIF for human termbases) • Lexicons and termbases • Needed: principled data-modeling approach • Wide variety of information to be treated • Wide range of formats currently in use AAAI 2002 WS
The SALT project • SALT: Standards-based Access service to multilingual Lexicons and Terminologies (www.ttt.org/salt). • International cooperation, standards for coding and interchange of linguistic data, and the combining of technologies • Several partners (BYU TRG, KSU, etc.) • Data modeling approach to addresses the problem of interchange among diverse collections of such data, including their ontological substructure AAAI 2002 WS
The SALT approach • Goal: provide • Modularity • differentiate core structure vs. data category specification • Coherence • use a meta-model • Flexibility • Support interoperable alternative representations • Modular meta-model approach • Implemented in various settings • Ongoing refinement: model’s coverage AAAI 2002 WS
The TIDIE project • TIDIE: Target-based Independent-of-Document Information Extraction (www.deg.byu.edu) • Ontology-based data extraction • Conceptual modeling of real-world applications • Narrow, data-rich domains • Leverage (or build) custom ontologies for target-based extraction AAAI 2002 WS
Leverage this … … to do this Information exchange Source Target Information Extraction Schema Matching AAAI 2002 WS
Information Extraction • Examine/retrieve information from documents to fill information from user-supplied template • Requires some user-oriented specification of information • Our approach: finding, extracting, structuring, and synthesizing information is easier given a conceptual-model-based ontology AAAI 2002 WS
Extracting pertinent information from documents AAAI 2002 WS
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual Modeling Solution AAAI 2002 WS
Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End; AAAI 2002 WS
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (363)835-8597 0002 1998 Elandra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction AAAI 2002 WS
Lexical resources for data modeling • Information extraction also requires knowledge representations with terminological and conceptual content. • Extraction ontology knowledge sources must: • be of a general nature • contain meaningful relationships • already exist in machine-readable form • have a straightforward conversion into XML. • This paper: create, leverage large-scale termbase • some ontological structure • reformatted according to the SALT standard • converted into μK-compliant XML for use by the ontology generator AAAI 2002 WS
Eurodicautom • Well-known, widely-used termbase • > 1 million concept entries • Wide range of topics • Entries are multilingual • Entry information: sources cited, input/approval dates, … • Single-word terms (e.g. “generator”) or multi-word expressions (e.g. “black humus”) • Entries each have Lenoch subject-area code • Hierarchical representation for classifying terms (and by extension their related concepts) AAAI 2002 WS
Partial Eurodicautom entry %%CM AG4 CH6 GO6 %%DA %%VE lavmosetørv %%RF A.Klougart %%EN %%VE black humus %%RF CILF,Dict.Agriculture,ACCT,1977 %%IT %%VE humus nero %%RF BTB %%ES %%VE humus negro %%RF CILF,Dict.Agriculture,ACCT,1977 %%SV %%VE sumpjord %%RF Mats Olsson,SLU(1997) AAAI 2002 WS
Sample Lenoch codes AD Public Administration - Private Administration - Offices AD1 general aspects of the subject field AD2 public and private organisations AD3 publications & documentary search AD31 documentation and information systems AD4 administrative staff AD5 public procurement AD51 expropriation in the public interest TEH testing methods TEH1 general aspects of testing methods TEH2 non-destructive testing TEH21 chemical tests TEH22 photometrical testing TEH221 X-ray spectrometrical testing AAAI 2002 WS
Converting the termbase • Use several thousand English terms and their subject codes • %%CM line lists three Lenoch codes: • AG4 (representing the subclass AGRONOMY), • CH6 (representing ANALYTICAL-CHEMISTRY) • GO6 (representing GEOMORPH-OLOGY). • Convert termbase entries via the SALT-developed TBX termbase exchange framework • XML-based refinement of MARTIF • Convert to μK XML format used by ontology engine • Result: TBX-mediated conversion from native Eurodicautom terms to the final XML-specified ontology (μK) • Lenoch codes re-interpreted as typical hierarchical relations (e.g. IS-A and SUBCLASS) AAAI 2002 WS
Conversion process SALT Eurodicautom (native) Eurodicautom (TBX) Eurodicautom (μK) Lenoch AAAI 2002 WS
Eurodicautom-TBX encoding <?xml version="1.0" encoding="UTF-8"?> <martif xmlns="x-schema:XLTcsV04.xml" lang="en" type="DXLT"> <martifHeader> <fileDesc> <sourceDesc> <p>sample Eurodicautom entry</p> </sourceDesc> </fileDesc> <encodingDesc> <p type="DCSName">DXLTdv04.xml</p> </encodingDesc> </martifHeader> <text> <body> <termEntry id="eid-EDIC-BTB-DAG77-63"> <admin type="originatingInstitution">BTB</admin> <admin type="projectSubset">DAG77</admin> <descrip type="reliabilityCode">4</descrip> <langSet lang="pt"> <ntig> <termGrp> <term>souto</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> <ntig> <termGrp> <term>minifúndio</term> <termNote type="termType">fullForm</termNote> </termGrp> <admin type="conceptID">BTB-DAG77-63</admin> <admin target="bib-Mock" type="sourceIdentifier">V.Correia,Engº Agrónomo,PDR Vale do Lima</admin> </ntig> </langSet> </termEntry> AAAI 2002 WS
Derived XML ontology <RECORD> <CONCEPT>xenobiotic substances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>hazardous raw materials </FILLER> <UID>0</UID> </RECORD> <RECORD> <CONCEPT>physical nuisances</CONCEPT> <SLOT>SUBCLASSES</SLOT> <FACET>VALUE/FACET> <FILLER>ambient light</FILLER> <UID>0</UID> </RECORD> <RECORD> <CONCEPT>financial statistics</CONCEPT> <SLOT>IS-A</SLOT> <FACET>VALUE/FACET> <FILLER>economic statistics</FILLER> <UID>0</UID> </RECORD> …. AAAI 2002 WS
Ontology generation • Goal: specify an ontology for information extraction purposes • Problem: complex, tedious, costly • Ideally: automatically generate schemas, ontologies • Source: natural-language text, tables, etc. AAAI 2002 WS
Ontology generation overview AAAI 2002 WS
Knowledge sources • Mikrokosmos (μK) ontology • About 5,000 hierarchically-arranged concepts • Fairly high connectivity ( about 14 inter-concept links per node) • Fairly general content, inheritance of properties • Data frame library • regular-expression templates for matching structured low-level lexical items (e.g. measurements, dates, currency expressions, and phone numbers) • provide information for conceptual matching via inheritance • Lexicons (e.g. onomastica, WordNet synsets) • Domain-specific training documents AAAI 2002 WS
Knowledge integration AAAI 2002 WS
Methodology • Preprocess input knowledge sources: • Integrate: map lexicon content and data frame templates to nodes in the merged ontology • Extract: match information from training documents collection • Parse, tokenize, regularize lexical content • Generate the ontology: four-stage generation process • concept selection • relationship retrieval • constraint discovery • refinement of the output ontology AAAI 2002 WS
Processing input documents AAAI 2002 WS
Concept selection • Finding which subset of the ontology’s concepts is of interest to a user • Concepts are selected via string matches between textual content and the ontological data. • Three different selection heuristics • concept-name matching • concept-value matching • data-frame pattern matching • String matching plus: • word synonym matching: WordNet synonym sets • multi-word term matching: bag-of-words(CAPITAL-CITY is considered a synonym of capital and city) AAAI 2002 WS
Concept selection algorithm PROCEDURE ConceptSelection(Tdoc, Kbase) SourceDoc = Parse(Tdoc); PrimarySelectedConceptsList = MikroSelection(M-Ontology); SecondarySelectedConceptsList = DataFrameSelection(DF-Library); ConflictHandling(); SelectedSubgraphGeneration(); AAAI 2002 WS
Basic Selection Strategy • Select from Mikrokosmos Ontology • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Afghanistan<Nation> • smaller than Texas<USState>. • Area<GeographicalArea>: 648,000 sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>:17.7 million. • Agriculture:Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • Afghanistan • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Basic Selection Strategy • Select from Mikrokosmos Ontology • concept names and their synonyms • concept values and their synonyms • Select from Data Frame Libraries • extract result based on the data frames • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Concept conflict resolution • Arrive at an internally consistent set of selected concepts. • Two levels of resolution • Document-level resolution • Knowledge-source resolution • Criteria: lexical occurrence, proximity, length and distribution of words and terms • Preferences from among knowledge sources specifying matches • Other default strategies AAAI 2002 WS
Document-Level Conflict • Afghanistan • smaller than Texas. • Area: 648,000<Area><Mileage> sq. km. • Capital<CapitalCity><FinancialCapital>--Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population:17.7<Time> million<Population><Price>. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Concept-Level Conflict • Afghanistan • smaller than Texas. • Area<GeographicalArea>: 648,000<Area> sq. km. • Capital--Kabul, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population<Population>: 17.7 million<Population>. • Agriculture: Wheat<FoodStuff><AgriculturalProduct>, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. AAAI 2002 WS
Relationship retrieval • Ontology structure: directed graph, nodes are concepts • Conceptual relationship: all paths connecting concepts generated at given stage • Theoretical solution: find all the paths in the graph (NP-complete) • When multiple paths do exist, take the shortest path between 2 concepts (Cf. μK Onto-Search algorithm) • Dijkstra’s (polynomial) algorithm to compute the most salient relationships between concepts • Distance threshold on path length to prune weak relationships • Construct schemas, or linked conceptual configurations, from the relationships posited in the previous step. • Primary concept selected (or posited): highest connectivity • Cardinalities inferred from observed relationships AAAI 2002 WS
Participation Constraints • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital—Kabul<CapitalCity>, • Other cities--Kandahar Mazar-e-Sharif Konduz • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. CapitalCity [1:1] IsA.CITY.PartOf Nation [1:1] AAAI 2002 WS
Participation Constraints (2) • Afghanistan<Nation> • smaller than Texas. • Area: 648,000 sq. km. • Capital--Kabul<City>, • Other cities<City>--Kandahar<City> Mazar-e-Sharif<City> Konduz<City> • Terrain: Landlocked; mostly mountains and desert. • Climate: Dry, with cold winters and hot summers. • Population: 17.7 million. • Agriculture: Wheat, corn, barley,rice, cotton, fruit, nuts, karakul pelts, wool, mutton. City [1:1] PartOf Nation [1:*] AAAI 2002 WS
Refining results • Output ontology: may require hand-crafting • can be done in a text editor (flat ASCII ontology) • Considerable expertise required: • markup syntax • specification of conceptual relations. • familiarity with regular-expression writing • Possible solution: ontology editors for typical end-users • With rich enough knowledge sources and a good set of training documents, however, we believe that the generation of extraction ontologies can be fully automatic. AAAI 2002 WS
Testing the system • Input: various of U.S. Department of Energy abstracts • Knowledge base: • μK ontology • Energy sub-hierarchy of Eurodicautom terms (300) AAAI 2002 WS
Sample application document The trend in supply and demand of fuel and the fuels for electric power generation, iron manufacturing and transportation were reviewed from the literature published in Japan and abroad in 1986. FY 1986 was a turning point in the supply and demand of energy and also a serious year for them because the world crude oil price dropped drastically and the exchange rate of yen rose rapidly since the end of 1985 in Japan as well. The fuel consumption for steam power generation in FY 1986 shows the negative growth for two successive years as much as 98.1%, or 65,730,000 kl in heavy oil equivalent, to that in the previous year. The total energy consumption in the iron and steel industry in 1986 was 586 trillion kcal (626 trillion kcal in the previous year). The total sales amount of fuel in 1986 was 184,040,000 kl showing a 1.5% increase from that in the previous year. The concept Best Mix was proposed as the ideal way in the energy industry. (21 figs, 2 tabs, 29 refs) AAAI 2002 WS
Sample output -- energy2 Information Ontology energy2 [-> object]; energy2 [0:*] has Alloy [1:*]; energy2 [0:*] has Consumption [1:*]; energy2 [0:*] has CrudeOil [1:*]; energy2 [0:*] has ForProfitCorporation [1:*]; energy2 [0:*] has FossilRawMaterials [1:*]; energy2 [0:*] has Gas [1:*]; energy2 [0:*] has Increase [1:*]; energy2 [0:*] has LinseedOil [1:*]; energy2 [0:*] has MetallicSolidElement [1:*]; energy2 [0:*] has Ores [1:*]; energy2 [0:*] has Produce [1:*]; energy2 [0:*] has RawMaterials [1:*]; energy2 [0:*] has RawMaterialsSupply [1:*]; Alloy [0:*] MadeOf.SOLIDELEMENT.Subclasses MetallicSolidElement [0:*]; Alloy [0:*] IsA.METAL.StateOfMatter.SOLID.Subclasses CrudeOil [0:*]; Alloy [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; AmountAttribute [0:*] IsA.SCALARATTRIBUTE.MeasuredIn.MEASURINGUNIT Consumption [0:*] IsA.FINANCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Agent Human [0:*]; ControlEvent [0:*] IsA.SOCIALEVENT.Location.PLACE.Subclasses Nation [0:*]; CountryName [0:*] NameOf Nation [0:*]; CountryName [0:*] IsA.REPRESENTATIONALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.OwnedBy Human [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Combine [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Display [0:*]; CrudeOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Produce [0:*]; Custom [0:*] IsA.ABSTRACTOBJECT.ThemeOf.MENTALEVENT.Subclasses AddUp [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.Subclasses Gas [0:*]; Display [0:*] IsA.PHYSICALEVENT.Theme.PHYSICALOBJECT.OwnedBy Human [0:*]; ForProfitCorporation [0:*] OwnedBy Human [0:*]; ForProfitCorporation [0:*] IsA.CORPORATION.HasNationality Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.Location.PLACE.Subclasses Nation [0:*]; Gas [0:*] IsA.PHYSICALOBJECT.ThemeOf.GROW.Subclasses GrowAnimate [0:*]; LinseedOil [0:*] IsA.PHYSICALOBJECT.ThemeOf.PHYSICALEVENT.Subclasses Increase [0:*]; AAAI 2002 WS
Evaluation • Several dozen relationships are generated • Correct: relationship is posited between the concept CRUDE-OIL and the action PRODUCE; the role is Theme, meaning that one can PRODUCE CRUDE-OIL • Incorrect: relationship between GAS and GROW • Precision: relatively low (around 75%) due to high number of matches • Recall: better (around 90%) • Note: it’s easier for a human to refine the system’s output by rejecting spurious relationships (i.e. deleting false positives) than to specify relationships that the system has missed. AAAI 2002 WS
How to improve results • Less general, more focused ontologies • Richer ontological structure • More types of hierarchical relationships (beyond IS-A and its inverse, SUB-CLASSES) • Deeper hierarchies (maximum 4 in Lenoch) • Note: TBX supports several data types for conceptual encoding AAAI 2002 WS
Related work • Lexical chaining in NLP • extracting and associating chains of word-based relationships from text • relating words and terms to resources like WordNet • Widely used in text categorization, automatic summarization, and topic detection and tracking • Our contributions: • integrating disparate knowledge sources for similar tasks • Discovering and generating a compatible set of ontological relationships AAAI 2002 WS
Conclusions • The knowledge acquisition bottleneck impacts ontology construction for information extraction. • Terminographers and lexicographers codify information that can be advantageous for work in semantic-based processing. • Integrating these two disparate areas, it is possible to leverage large-scale terminological and conceptual information with relationship-rich semantic resources in order to reformulate, match, and merge retrieved information of interest to a user. • Possible future applications: • Knowledge-focused personal agents • Customized search, filtering, and extraction tools • Individually tailored views of data via integration, organization, and summarization • Lots of work still to be done… AAAI 2002 WS