380 likes | 481 Views
KYOTO ( ICT - 211423) Y ielding O ntologies for T ransition-Based O rganization FP7: Intelligent Content and Semantics http://www.kyoto-project.eu/ Piek Vossen, VU University Amsterdam. Asian Language Resources Summit , Phuket, March, 2009. Overview. Background information
E N D
KYOTO (ICT-211423)Yielding Ontologies for Transition-Based OrganizationFP7: Intelligent Content and Semantics http://www.kyoto-project.eu/ Piek Vossen, VU University Amsterdam Asian Language Resources Summit, Phuket, March, 2009
Overview • Background information • Baseline for retrieval in environment domain • System architecture • Knowledge mining • Conclusions Asian Language Resources Summit, Phuket, March, 2009
KYOTO (ICT-211423) Overview • Title: Knowledge Yielding Ontologies for Transition-Based Organization • Funded: • 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics • Taiwan and Japan funded by national grants • Goal: • Open and free platform for knowledge sharing across languages and cultures • Wiki environment that allows people in the field to maintain their knowledge and agree on meaning without knowledge engineering skills • Bootstrap through open text mining & concept learning • Enables knowledge transition and information search across different target groups, transgressing linguistic, cultural and geographic boundaries. • Enables deep semantic search for facts and knowledge • URL: http://www.kyoto-project.eu/ (http://www.kyoto-project.eu/) • Duration: • March 2008 – March 2011 • Effort: • 364 person months of work. Asian Language Resources Summit, Phuket, March, 2009
Consortium • Vrije Universiteit Amsterdam (Amsterdam, The Netherlands), • Consiglio Nazionale delle Ricerche (Pisa, Italy), • Berlin-Brandenburg Academy of Sciences and Humantities (Berlin, Germany), • Euskal Herriko Unibertsitatea (San Sebastian, Spain), • Academia Sinica (Tapei, Taiwan), • National Institute of Information and Communications Technology (Kyoto, Japan), • Irion Technologies (Delft, The Netherlands), • Synthema (Rome, Italy), • European Centre for Nature Conservation (Tilburg, The Netherlands), • Subcontractors: • World Wide Fund for Nature (Zeist, The Netherlands), • Masaryk University (Brno, Czech) Asian Language Resources Summit, Phuket, March, 2009
KYOTO (ICT-211423) Overview • Languages: • English, Dutch, Italian, Spanish, Basque, Chinese, Japanese • Domain: • Environmental domain, BUT usable in any domain • Global: • Both European and non-European languages • Available: • Free: as open source system and data (GPL) • Future perspective: • Content standardization that supports world wide communication Asian Language Resources Summit, Phuket, March, 2009
Baseline for environment domain • Mainly use Google, first 10 hits, no advanced options • Textual search with linguistic enhancements but no real semantic search: • polluted water…. • polluting water…. • Growing time & information pressure: • deliver actual information from diverse & dynamic sources • regional, local situations ►no general source • various subdomains ► government, legal, biology, health, industry • difficult access ► scientific publications • no time to read ► too much information and work pressure • dependent on trust: scientists ► environmentalist ►government ►general public Asian Language Resources Summit, Phuket, March, 2009
High-level targets &Low-level questions • High level target (about 300 questions collected) • Are there huge negative effects with regard to ecological networks and alien invasive species? • Low level facts that support answering the high level targets: • cases of alien invasion • amount of species • causal relations associated with these (increments of) invasions • causes related to ecological networks • limit in the same time and location boundary Asian Language Resources Summit, Phuket, March, 2009
Baseline retrieval results 6 persons, 30 high-level questions, Asian Language Resources Summit, Phuket, March, 2009
KYOTO's Solution • Text mining: • Massive and accurate indexing of facts from vast amounts of text; • In any language/culture from scattered sources; • Again and again to detect trends and changes; • Direct relation between knowledge modeling effort and text mining • Knowledge modeling: • automatic learning of terms and concepts from text in any language; • formalization of knowledge in computer usable format -> wordnets & ontologies • Community software: • For experts in the field and not knowledge engineers • Continuous and collaborative effort: • adapt to the changing domain; • consensus in the field; • consensus across languages and cultures • Produce interoperable, formal, standardized knowledge structures; • Relate knowledge structure to expressions in languages Asian Language Resources Summit, Phuket, March, 2009
Distributed, diverse & dynamic data 1 Citizens 4 Governments maintain terms & concepts Companies Wikyoto Capture text: "Sudden increase of CO2 emissions in 2008 in Europe" Ontology 2 Top Abstract Physical Tybot: term yielding robot Wordnets Process Substance 3 CO2 emission Middle H20 CO2 H20 Pollution CO2 Emission Greenhouse Gas Domain Kybot: knowledge yielding robot Index facts: Process: Emission Involves: CO2 Property: increase, sudden When: 2008 Where: Europe 5 6 Text & Fact Index Semantic Search Environmental organizations
Multilingual Knowledge Base Linguistic Processor 2 Kybot Wikyoto Semantic & Syntactic Base Kyoto Annotation Format (KAF) Fact Extractor Wiki Term Editor 1 3 Fact Base Tybot Term Base Term Extractor Semantic Search Original Document Base Keyword Search Wordnets Ontologies interlinked Concept User Fact User Data Flow Diagram of Kyoto System End User End User
Kyoto Annotation Format KAF ENG-3.0-107695012-N • Kyoto Annotation Format (Level 1) a multi-layered annotation format for: • Tokenizaton and word form segmentation • POS tagging • Lemmatization and Term extraction • Constituency Tagging • Dependency Tagging Asian Language Resources Summit, Phuket, March, 2009
Semantic Annotation no synsets • Semantic Annotation Format for: • Named Entity Recognition (time, events, quant. …) • Word Sense Disambiguation (D-WSD) • Semantic Role Labeling (SRL) • KAF level2 (SemKAF) ENG-3.0-107630294-N Asian Language Resources Summit, Phuket, March, 2009
KAF annotation: WSD <term tid="t4" type="open" lemma="population" pos="N"> <span> <target id="w4"/> </span> <senseAlt> <sense sensecode="EN-17-00861095-n" /> <sense sensecode="EN-17-00859568-n" /> ....... <term tid="t4" type="open" lemma="population" pos="N"> <span> <target id="w4"/> </span> <senseAlt> <sense sensecode="EN-17-00859568-n" confidence="0.80 "/> <sense sensecode="EN-17-00257849-n" confidence="0.13 /> <sense sensecode="EN-17-00962397-n" confidence="0.07 /> </senseAlt> </term> Asian Language Resources Summit, Phuket, March, 2009
Data formats Level of annotation: • Morpho-syntax annotation • Semantic annotation • Terms representation • Facts annotation • Wordnets • Ontologies • Standard format • }KAF <=(MAF, SYNAF, SEMAF) • TMF • KAF • Wordnet-LMF • OWL Asian Language Resources Summit, Phuket, March, 2009
Knowledge mining • Concept mining (Tybots): • Extract terms and relations in a language • Map the terms to an existing wordnet • Ontologize terms to concepts and axioms • Fact mining (Kybots) • Define logical patterns • Define expression rules in a language Asian Language Resources Summit, Phuket, March, 2009
What Tybots do... • Input are text documents • Linguistic processors generate KAF annotation (sequential): • morpho-syntactic analysis • semantic roles • named entities • wordnet and ontology mappings • Output are term hierarchies in TMF (generic): • structural parent relations • quantified structural and semantic relations • statistical data Asian Language Resources Summit, Phuket, March, 2009
English Wordnet Ontology Term hierarchy location:3 substance:1 naturalprocess:1 of Synthesize Ontologize Abstract Physical region:3 area emission gas emission:3 Process Substance geographical area:1 area:1 gas:1 CO2 emission:2 greenhouse gas agricultural area Chemical Reaction H20 CO2 GreenhouseGas greenhouse gas:1 rural area:1 in CO2 GlobalWarming CO2Emission farmland:2 WaterPollution Axiomatize Conceptual modeling Source Documents [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP TYBOT Concept Miners Linguistic Processors Morpho-syntactic analysis (instance s1 Substance) (instance e1 Warming) (katalyist s1 e1) Asian Language Resources Summit, Phuket, March, 2009
What Kybots do • Input: • KAF annotations of text: sequential & encoded by language • Conceptual frame from the ontology • Expression rules for frame to language mapping: • Wordnet in a language • Morpho-syntactic mappings rules • Output are a database of facts in FactAF (generic): • aggregated facts • inferred facts • language neutral Asian Language Resources Summit, Phuket, March, 2009
Fact mining • KYBOT = Knowledge Yielding Robot • Logical expression • (instance, e1, Burn) (instance, e2, Warming) (cause, e1, e2) • (instance, s1, CO2) (instance, e1, GlobalWarming) (katalyist, s1,e1) • Expression rules per language: • [N[s1]V[e1]]S e.g. "CO2 is emitted", "fine dust blocks sun-light" • [N[s1]N[e1]N e.g. "CO2 emission", "sun-light blocking" • [[N[e1]][prep][N[s2]]NP e.g. "emission of CO2", "sun light blocking by fine dust" • Ontology * Wordnets • Capabilities: WNT -> adjectives ("explosive", "toxic"), WNT -> nouns ("explosive", "poison") • Causes: WNT -> verbs ("eat") , WNT -> nouns ("consumption") • Process: DamageProcess, ProduceProcess • Kybot compiler • kybots = logical pattern+ ontology + WN[Lx] + ER[Lx] Asian Language Resources Summit, Phuket, March, 2009
Source Documents Morpho-syntactic analysis (KAF) [[the emission]NP [of greenhouse gases]PP [in agricultural areas]PP] NP Logical Expressions Fact analysis Generic [[the emission]NP ] Process: e1 [of greenhouse gases]PP Patient: s2 [in agricultural areas]PP] Location: a3 Domain Fact mining by Kybots Linguistic Processors Ontology Wordnets & Linguistic Expressions Abstract Physical Patient Substance Process • semantic role labelling • time & place • aggregation from all relevant phrases and documents • inferencing • adding trust and reliability Chemical Reaction H2O CO2 Patient CO2 emission water pollution Asian Language Resources Summit, Phuket, March, 2009
Facts in RDF Wordnets in LMF Ontologies in OWL-DL G-WN G-KON SUMO DOLCE GEO plugin plugin DE-WN DE-KON WIKIPEDIA FRAMENET pdf Simplified Term Fragment Simplified Ontology Fragment population Group ?Population marine species terrestrial species Interview Interview Do populations consist of marine species? Smart Kytext Are terrestrial species a type of populations? .... populations such as terrestrial and marine species ..... .... populations declined .....terrestrial and marine species.. in forests .....declined FactAF KAF Kybots Kyoto Server KAF Tybots DE-TN Hidden Shown A.. ... decline ... population ... ..Z Do populations always consist of marine species? Are terrestrial species never marine species? Asian Language Resources Summit, Phuket, March, 2009
Kyoto Knowledge Base Domain WnJP Domain Domain WnIT WnNL Domain Ontology Ontology Domain Domain Ontology WnES WnEN Domain Domain WnEU WnCH
Ultimate goal • Global standardization and anchoring of meaning such that: • Machines can start to approach text understanding -> semantic web connects to the current web • Communities can dynamically maintain knowledge, concepts and their terms in an easy to use system • Cross-linguistic and cross-cultural sharing and communication of knowledge is enabled • Establish a Global-Wordnet-Grid: formalization of Wikipedia for humans AND machines across languages Asian Language Resources Summit, Phuket, March, 2009
Fahrzeug 1 Auto Zug 2 vehicle German Words 1 car train 2 English Words 3 3 vehículo 1 auto tren veicolo 1 2 Spanish Words auto treno 2 Italian Words Global WordNet Grid Inter-Lingual Ontology voertuig 1 auto trein Object 2 liiklusvahend Dutch Words 1 Device auto killavoor TransportDevice 2 Estonian Words véhicule 1 voiture train 2 dopravní prostředník French Words 1 auto vlak 2 Asian Language Resources Summit, Phuket, March, 2009 Czech Words
Wordnet environment terms Wordnet environment terms Wordnet environment terms Wordnet environment terms Wordnet environment terms Linking Open Data dataset cloud http://richard.cyganiak.de/2007/10/lod/ legal facts environment facts medical facts Wordnet sailing terms Wordnet legal terms Wordnet medical terms Ontology environment concepts Ontology legal concepts Ontology medical concepts Ontology sailing concepts Asian Language Resources Summit, Phuket, March, 2009
Kyoto main assets • Wiki platform (WIKYOTO) for connecting, transferring and controlling knowledge and information across people and computers • Term yielding robots (TYBOT): software that extracts terms and concepts from documents • Knowledge yielding robots (KYBOT): fact extraction software that generates a comprehensive list of facts from collection of sources • Fact repositories & fact alert: reports changes in facts on a collection of sources • Domain WORDNETS and domain ONTOLOGIES • Create the backbone for the Global Wordnet Grid Asian Language Resources Summit, Phuket, March, 2009
What makes KYOTO unique? • Integrates & combines all ► knowledge engineering, language engineering, wikis, term & concept learning, fact mining from text in and across languages, & standardization • Direct relation between concept modeling and text mining ► make it worth the effort • Wikyoto community tool ► hides technology and complex knowledge and language representation • Operated by community people and not by knowledge engineers and language technology people ► exploits massive labor force of communities all over the world Asian Language Resources Summit, Phuket, March, 2009
What makes KYOTO unique? • Text mining and ontology learning developed for separate languages • ►KYOTO multi and cross-lingual & cultural • ►cross-lingual and cross-cultural semantic interoperability • Text mining and ontology learning is often limited to a specific domain and/or application ►KYOTO for any domain and application • Text mining and ontology learning does not relate the terms and concepts to generic language and knowledge resources ►KYOTO anchors knowledge from a community to general vocabulary and likewise to other communities Asian Language Resources Summit, Phuket, March, 2009
environment facts Wordnet environment terms Wordnet environment terms Wordnet environment terms Wordnet environment terms Ontology environment concepts Contribution of KYOTO • KYOTO delivers a Web 2.0 environment for community based control • Connects people across language and cultures • Establish consensus and knowledge transition • KYOTO learns terms and concepts from text documents, • Stored as structures that people and computers understand • hundreds of thousands sources in the environment domain • in many different languages • spread all over the world • changing every day • KYOTO enables semantic search and fact extraction • Software can partially understand language and exploit web 1 data • Understanding is helped by the terms and concepts defined for each language html pdf xls KYBOT WIKYOTO TYBOT Asian Language Resources Summit, Phuket, March, 2009