360 likes | 476 Views
Ontologies Contributions from Language Technology Paul Buitelaar DFKI GmbH Language Techology Lab DFKI Competence Center Semantic Web Saarbrücken, Germany. Overview.
E N D
Ontologies Contributions from Language TechnologyPaul BuitelaarDFKI GmbHLanguage Techology LabDFKI Competence Center Semantic WebSaarbrücken, Germany
Overview Ontologies and the Semantic WebSemantic Web Intro Ontologies and Knowledge Markup Ontology Development Ontology Lifecycle & Language Technology Language TechnologyLevels of Automatic Linguistic AnalysisOntologies in Multilingual Information Access A Medical Example: MuchMore Project Semantic Resources in the Medical Domain Demo MuchMore System Language Technology in Annotation and Indexing ConclusionsMuchMore for the Legal Domain…
Semantic Web Semantic Web Services Semantic Web Knowledge Markup Ontologies Intelligent Man-Machine Interface
Ontology-based Knowledge Markup Semantic Metadata • Metadata, e.g. Dublin Core -- Title, Author, etc. • Semantic: Formal Properties of Objects of Class Author Knowledge Markup <xmnls jobs="http://www.jobs.org/daml+oil-jobs-ontology#"> <jobs:systems-analyst> John Smith </jobs:systems-analyst>
Semantic Web Architecture Layered Architecture (Tim Berners-Lee)
Syntax Semantics XML XML Schema NamespacesInterpretation Context Data Types Formalization: Classes (Inheritance), Properties RDF Schema RDF Formalization: Classes, Class Definitions, Properties, Property Types (e.g. Transitivity) OWL (DAML+OIL) Knowledge Markup Languages
Ontologies: Basic Idea • Definition • “… Explicit, Formal Specification of a Shared Conceptualization of aDomain of Interest” T. Gruber Towards principles for the design of ontologies used for knowledge sharing. Int. J. of Human and Computer Studies, 1994 • Purpose • Knowledge Sharing (e.g. between Agents) • Inference (over Sets of Instances) • Related Areas, e.g. • Terminologies, Controlled Vocabulary, Thesauri, Taxonomies, Semantic Lexicons, Wordnets, etc. • Conceptual Models, Schemas, etc.
Ontologies: Applications, e.g. • Semantic Web Services • Interoperability for (Semantic) Web Services • Intelligent Agents • Domain Models for Intelligent Agents • Text Interpretation • Ontology-aware Information Extraction • Multimedia Integration • Ontology-based Alignment of Extracted Objects in Text, Audio, Video • Intelligent Search/Navigation • Ontology-based Indexing in Web-Retrieval
Ontologies: Development • Ontology Editor / KB Management • Most Widely Used: Protégé (Stanford University, Medical Informatics, USA) • Originally for Development and Maintenance of Medical Expert Systems • Other, e.g. • KAON: University of Karlsruhe - AIFB, Germany • WebOde: UPM – Ontology Group, Madrid, Spain • WebOnto: Open University - KMI, UK • Overview at XML.comby Michael Denny: Ontology Building: A Survey of Editing Tools
Class Hierarchy Slot Descriptions http://dmag.upf.es/ontologies/2003/12/ipronto.owl
Ontology Lifecycle Populating Validating Creating Deploying Evolving Maintaining
LT in the Ontology Lifecycle Language Technology (LT) for Ontology: Creating & Evolving Linguistic Analysis to Extract Classes / Relations Classes, Relations/Properties Ontology (Knowledge) Documents (Text) Populating (Knowledge Base Generation) Linguistic Analysis to Extract Instances Instances Language Technology = Automated Linguistic Analysis
Linguistic Analysis: Example The Dell computer with a flat screen had to be rejected because of a failure in the motherboard. flat screen Dell computer has-a reject has-a animate-entity motherboard failure location-of
Part-of-Speech, Morphology Part-of-Speech • e.g.: noun, verb, adjective, preposition, … PoS tag sets may have between 10 and 50 (or more) tags Morphology • Most languages have inflection and declination, e.g.: Singular/Plural computer, computers Present/Past reject, rejected Many languages have also complex (de)composition, e.g.:Flachbildschirm(flat screen) >flach + Bildschirm>flach + Bild + Schirm
Phrases, Terms, Named Entities Semantic Units • Phrases (e.g. nominal - NP, prepositional - PP)NP a flat screen PP with a flat screen NP (recursive) the Dell computer with a flat screena failure in the motherboard Terms (domain-specific phrases)Dell computerDell computer with a flat screen Named Entities (phrases corresponding to dates, names, …) COMPANY Dell COMPANY Dell Computer Corporation PERSON Michael Dell
Dependency Structure Semantic Structure Dependencies between Predicates and Argumentsthe Dell computer with a flat screen had to be rejectedPRED: reject ARG1: ENTITY ARG2: ‘the Dell computer with a flat screen’‘Logical Form’ :reject(x,y) & animate-entity(x) & computer(y) & … The Dell computer that has been rejected was claimed to have suffered from handling.reject(e1,x1,y1) & animate-entity(x1) & Dell_computer(y1) & claim(e2,x2,e3) & animate-entity(x2) & suffer_from(e3,y1,y2) & handling (y2)
MuchMore Project http://muchmore.dfki.de Demonstration Prototype Real-Life Medical Scenario for Cross-Lingual Information Retrieval Research & Development Combined Data- and Knowledge-Driven Performance Evaluation Performance Comparison of Existing and Novel Methods
Semantic Resources Medical Domain UMLS: Unified Medical Language System Medical MetaThesaurus (only MeSH2001 is used) English, German, Spanish, … 730.000 Concepts 9 Relations (Broader, Narrower,…) Semantic Network 134 Semantic Types 54 Semantic Relations General WordNet (EN), GermaNet (DE), EuroWordNet (“linked”)
C0019682|ENG|P|L0019682|PF|S0048631|HIV|0| C0019682|ENG|S|L0020103|PF|S0049688|HTLV-III|0| C0019682|ENG|S|L0020128|VS|S0049756|Human Immunodeficiency Virus|0| C0019682|ENG|S|L0020128|VWS|S0098727|Virus, Human Immunodeficiency|0| C0019682|FRE|P|L0168651|PF|S0233132|HIV|3| C0019682|FRE|S|L0206547|PF|S0277133|VIRUS IMMUNODEFICIENCE HUMAINE|3| C0019682|GER|P|L0413854|PF|S0538136|HIV|3| C0019682|GER|S|L1261793|PF|S1503739|Humanes T-Zell-lymphotropes Virus Typ III|3| Concept Names: 1.734,706 ENGLISH 1.462,202 GERMAN 66,381 other languages MetaThesaurus, SemNet • Each CUI (Concept Unique Identifier) is mapped to one out of 134 Semantic Types or TUI (Type Unique Identifier) • Clozapine: C0009079 Pharmacologic Substance: T121 • Semantic Types are organized in a Network through 54 Relations • T121|T154|T047
Token (with Part-of-Speech) German: Kreuzbandes English: ligaments Lemma (or Sequence of Lemmas - Decomposition) German: Faserknorpel Faser + Knorpel English: ligament UMLS Concept Code and Semantic Type ligament : C0022745_T030 MeSH Code A2.513 Semantic Relation (over a Pair of UMLS Concepts) C0022745_T030 interconnects C0047693_T065 Annotation & Indexing
UMLS Semantic Network specifies 54 types of relations between 134 semantic types Pharmacologic SubstanceaffectsCell Function Relations are generic and potentially false Therapeutic Proceduremethod_of Occupation,Discipline *discectomymethod_ofhistory Relations are ambiguous Therapeutic ProcedurepreventsNeoplastic Process Therapeutic ProcedurecomplicatesNeoplastic Process Therapeutic ProcedureaffectsNeoplastic Process Therapeutic ProceduretreatsNeoplasticProcess Relations
Discontinuation of heparin is a simple andessential maneuvre, and anticoagulation has tobe continued by alternative drugs. Example
Terms:C0019134Heparin C0005790 Blood coagulation tests C0013227Pharmaceutical preparations Example: Terms/Concepts Discontinuation of heparin is a simple andessential maneuvre, and anticoagulation has tobe continued by alternative drugs.
Example: Relations Discontinuation of heparin is a simple andessential maneuvre, and anticoagulation has tobe continued by alternative drugs. Terms:C0019134Heparin C0005790 Blood coagulation tests C0013227Pharmaceutical preparations • Relations: C0019134 interacts_with C0013227 • C0005790 analyses C0019134 • C0005790 analyses C0013227
Conclusions MuchMore for the Legal Domain… ResourcesLegal Domain Ontology with……Large-scale Terminology for Multiple Languages, or if not available……Large Legal Domain Corpora in Multiple Languages for Term Extraction……and for Relation Extraction if Ontology Needs to be Constructed/Adapted ToolsLinguistic Analysis (PoS, Morphology, Term Grammars, etc.)……for Multiple Languages……Tuned to the Legal Domain…Information Retrieval Infrastructure, Interface Design, etc.