670 likes | 839 Views
Language Technology I 2005/06. Knowledge Extraction/Semantic Web. Paul Buitelaar German Research Center for Artificial Intelligence (DFKI). Overview. Semantic Web Introduction Semantic Web Representation and Query Languages Semantic Web Tools Ontologies and Knowledge Markup
E N D
Language Technology I2005/06 Knowledge Extraction/Semantic Web Paul Buitelaar German Research Center for Artificial Intelligence (DFKI)
Overview • Semantic Web • Introduction • Semantic Web Representation and Query Languages • Semantic Web Tools • Ontologies and Knowledge Markup • Ontologies and other Knowledge Organization Systems • Knowledge Markup for Ontology Population • Ontology Life-Cycle • Knowledge Extraction • Ontology Population • Ontology Learning
Web Web Docs, Data
Web > Semantic Web Web Docs, Data Knowledge Markup
Web > Semantic Web Web Docs, Data Knowledge Markup Ontologies
Web > Semantic Web Knowledge Markup Ontologies
Accessing the Semantic Web - Machines Semantic Web Services Knowledge Markup Ontologies
Accessing the Semantic Web - Humans Semantic Web Services Knowledge Markup Ontologies Intelligent Man-Machine Interface
Semantic Web Layer cake • Introduced by Tim Berners-Lee in 2001 • Built upon existing WWW standards
Resource Description Framework (RDF) • RDF is an extensible language for expressing graph-structures • Serializes to XML <?xml version=‘1.0’ ?> <rdf:RDF xmlns:rdf=“… rdf-syntax-ns#” xmlns:rdfs=“… rdf-schema#” xmlns=“http://example.org”> <rdf:Descriptionrdf:nodeID=“node1”> <name>DFKI GmbH</name> <location>Kaiserslautern</location> <www rdf:resource=“http://www.dfki.de” /> </rdf:Description> </rdf:RDF> DFKI GmbH name node1 www http://www.dfki.de location Kaiserslautern
RDF Schema (RDFS) • Adds a vocabulary for representing classes and properties to RDF Student enrolledIn is-a Course Person Teacher is-a teaches name rdf:Literal
Web Ontology Language (OWL) • OWL - Based on Description Logics • Adds further modelling vocabulary on top of RDFS Syntax Semantics XML XML Schema NamespacesInterpretation Context Data Types Formalization: Classes (Inheritance), Properties RDF Schema RDF Formalization: Classes, Class Definitions, Properties, Property Types (e.g. Transitivity) OWL
Semantic Web Query Languages - SPARQL • SPARQL - query language developed by W3C • Syntactically based on SQL: • Results available as XML Documents PREFIX foaf: <http://xmlns.com/foaf/0.1/> SELECT ?foafName WHERE { ?x foaf:name ?foafName . OPTIONAL { ?x foaf:mbox ?mbox } . }
Semantic Web Tools • Programming APIs • Jena - Java • Redland – Python, … • RAP - PhP • Editors • Protégé • OntoStudio • Triple20 - Prolog • Storage • Sesame • OntoBroker
Ontologies in Philosophy • Ontology is a branch of philosophy that deals with the nature and the organization of reality • Science of Being (Aristotle, Metaphysics) • What characterizes being? • Eventually, what is being?
Ontologies in Computer Science • Ontology refers to an engineering artifact • a specific vocabulary used to describe a certain reality • a set of explicit assumptions regarding the intended meaning of the vocabulary • An Ontology is • an explicit specification of a conceptualization [Gruber 93] • a shared understanding of a domain of interest [Uschold/Gruninger 96]
Why Develop an Ontology? • Make domain assumptions explicit • Easier to change domain assumptions • Easier to understand and update legacy data • Separate domain knowledge from operational knowledge • Re-use domain and operational knowledge separately • A community reference for applications • Shared understanding of what information means
Types of Ontologies [Guarino, 98] Describe very general concepts like space, time, event, which are independent of a particular problem or domain. It seems reasonable to have unified top-level ontologies for large communities of users. Describe the vocabulary related to a generic domain by specializing the concepts introduced in the top-level ontology. Describe the vocabulary related to a generic task or activity by specializing the top-level ontologies. These are the most specific ontologies. Concepts in application ontologies often correspond to roles played by domain entities while performing a certain activity.
General logical constraints Formal Is-a Thesauri Frames Catalog / ID Informal Is-a Formal Instance Value Restric- tions Terms/ Glossary Axioms Disjoint Inverse Relations, ... Ontologies and Their Relatives
Knowledge Organization Systems • Semantic Lexicons – e.g. WordNet • … group together words according to lexical semantic relations like synonymy, hyponymy, meronymy, antonymy, etc. • Thesauri • …group together domain terms according to a set of taxonomic relations, including broader term, narrower term, sibling, etc. • Semantic Networks and Ontologies • … group together classes of objects according to a set of relations that originate in the nature of the domain of application. • Ontologies are defined by a formal semantics, but semantic networks may be informally defined. Therefore all ontologies are semantic networks, but not all semantic networks are ontologies.
Thesauri - Examples MeSH Heading Databases, Genetic Entry Term Genetic Databases Entry Term Genetic Sequence Databases Entry Term OMIM Entry Term Online Mendelian Inheritance in Man Entry Term Genetic Data Banks Entry Term Genetic Data Bases Entry Term Genetic Databanks Entry Term Genetic Information Databases See Also Genetic Screening MeSH (Medical Subject Headings) is organized by terms (currently over 250,000) that correspond to a specific medical subject. For each such term a list of syntactic, morphological or semantic variants is given. MT 3606 natural and applied sciences UF gene pool genetic resource genetic stock genotype heredity BT1 biology BT2 life sciences NT1 DNA NT1 eugenics RT genetic engineering (6411) EuroVoc covers terminology in all of the official EU languages for all fields that concern the EU institutions, e.g., politics, trade, law, science, energy, agriculture, 27 such fields in total.
Semantic Networks - Examples Pharmacologic Substance affects Pathologic Function Pharmacologic Substance causes Pathologic Function Pharmacologic Substance complicates Pathologic Function Pharmacologic Substance diagnoses Pathologic Function Pharmacologic Substance prevents Pathologic Function Pharmacologic Substance treats Pathologic Function UMLS (Unified Medical Language System) integrates linguistic, terminological and semantic information. The Semantic Network consists of 134 semantic types and 54 relations between types. Accession: GO:0009292 Ontology: biological process Synonyms: broad: genetic exchange Definition: In the absence of a sexual life cycle, the processes involved in the introduction of genetic information to create a genetically different individual. Term Lineage all : all (164142) GO:0008150 : biological process (115947) GO:0007275 : development (11892) GO:0009292 : genetic transfer (69) GO (Gene Ontology) allows for “consistent descriptions of gene products in different databases, including several of the world’s major repositories for plant, animal and microbial genomes…“ Organizing principles are molecular function, biological process and cellular component.
Example Ontology Consider an Example Ontology for the Newspaper Domain
Knowledge Markup • Ontologies are used to semantically organize and retrieve data (structured, textual, multimedia) through knowledge markup Consider the following example: • Knowledge Markup from Text is based on Named-Entity Recognition, Semantic Tagging (Term to Class Mapping) and Relation Extraction <news:story xmnls:jobs=“http://www.jobs.org/owl-jobs#” xmlns:com=“http://www.companies.org/owl-companies#” xmlns:it=“http://www.it.net/owl-it#”> “We were surprised by several of the results, particularly the order of finish,” said <jobs:SystemsAnalyst>Dan Olds</jobs:SystemsAnalyst>. <com:Company>IBM</com:Company> finished first with very strong results, and <com:Company>HP</com:Company> scored a solid number two; we expected to see <com:Company>Sun Microsystems</com:Company> challenging for first place or at least a strong second place. As the largest <it:operatingsystem>UNIX</it:operatingsystem> vendor in terms of number of installed systems, a third place finish should put their management on notice that their installed base may be vulnerable.
Knowledge Markup - Images Semantic Annotation of Medical Images (miAKT Project - UK)
Knowledge Markup - Images Semantic Annotation of Video (SmartMedia – DFKI KM)
Ontology Life-Cycle Populate Knowledge Base Generation Validate Consistency Checks Create/Select Development and/or Selection Evolve Extension, Modification Deploy Knowledge Retrieval Maintain Usability Tests
Knowledge Extraction • Ontology Population & Ontology Learning
Ontology Life-Cycle – Ontology Population Populate Knowledge Base Generation Validate Consistency Checks Create/Select Development and/or Selection Evolve Extension, Modification Deploy Knowledge Retrieval Maintain Usability Tests
Ontology Population with SOBA • SOBA: SmartWeb Ontology-based Annotation • Application Context • SmartWeb (http://www.smartweb-projekt.de/) – German Project around World-Cup 2006 • Integrates • Multimodal Dialog Processing • IR-based Question Answering • Ontology-Based Information Extraction • Semantic Web Services • Ontology-Based Information Extraction … • Combines: • Semantic Wrapping of Semi-Structured Data • Semantic and Linguistic Annotation of Free Text • Inference Rules for Instantiation and Integration of Annotated Entities and Events • … and Display • Ontology-driven Hyperlink Generation for Display of Extracted Information
SOBA – Processing and Data Flow Ontologies Documents Knowledge Base Wrapping of SemiStructured Data Inference Rules for Instantiation & Integration Image Extraction Named Entity Recognition & Semantic Tagging PDF Analysis Linguistic Annotation
SWIntO: SmartWeb Integrated Ontology • SWIntO (by AIFB, DFKI KM/IUI, EML) covers • Foundational (DOLCE) and General (SUMO) Knowledge • Domain- and Task-Specific Knowledge • Football / Sport Events • Navigation, Discourse, Multimedia • other SmartDOLCE:Entity … … SmartSUMO:Attribute SmartSUMO:Proposition SmartSUMO:SocialRole … … SportEvent:FootballPlayer SportEvent:FootballOrganizationPerson … SportEvent:Goalkeeper SportEvent:FootballClubPresident … … …
SmartWeb Corpus • (Growing) Web Corpus through Monitor on • http://fifaworldcup.yahoo.com/ • http://www.uefa.com/competitions/worldcup • Semi-Structured Data • Tabular: Match Reports, Teams, etc. • Free Text • Match Reports • Image Captions
Information Extraction from Free Text MatchEvent [Score, Team1, Team2] FootballPlayer
Information Extraction from Image Captions FoulEvent [FootballPlayer] FootballPlayer
Linguistic and Semantic Annotation Mark Crossley saved twice with his legs from Huckerby. Named Entity Recognition & Semantic Tagging [Mark Crossley GOALKEEPER][saved GOALKEEPER_ACTION] twice with his legs from [Huckerby PLAYER]. Linguistic Annotation [Mark Crossley GOALKEEPER : SUBJ][saved PRED : GOALKEEPER_ACTION] twice [with his legs PP_OBJ][from [Huckerby PLAYER] PP_ADJUNCT]. [ GOALKEEPER_ACTION = 'save‘, GOALKEEPER = 'Mark Crossley‘, PLAYER = 'Huckerby‘, MANNER = ‘legs']
Annotation/Extraction Example • Example Sentence from Match Report Allerdings ist Petrow fuer die Partie gegen Schweden gesperrt und kann erst gegen Ungarn eingesetzt werden. “However Petrow has been banned for the match against Sweden and can again be deployed against Hungary.” • Annotated/Extracted Information (with SProUT IE Tool - DFKI-LT ) • player_action & [GAME_EVENT "Ban", AGENT player & [SURNAME "PETROW"], IN_MATCH game & [TEAM2 "SWE", TOURNAMENT "Match"]] team & [NAME "HUN"]
Knowledge Base Generation • Transformation of SProUt Output to F-Logic via Declarative Mappings, e.g.: • <type orig="player" target="dolce#natual-person-denomination> • <link type="dolce#natural-person" method="dolce#HAS-DENOMINATION" id=""/> • <map> • <simple-mapping> • <input> • <arg orig="GIVEN_NAME" target="VAR1"/> • </input> • <output method="dolce#FIRSTNAME" value="VAR1"/> • </simple-mapping> • <simple-mapping> • <input> • <arg orig="SURNAME" target="VAR1"/> • </input> • <output method="dolce#LASTNAME" value="VAR1"/> • </simple-mapping> • </map> • </type>
SProUt to F-Logic • FS type="player_action"> • [N [N <F name="GAME_EVENT"> • <FS type="world champion"/> • <F name="ACTION_TIME"> • <FS type="1990"/> • <F name="ACTION_LOCATION"> • <FS type="Italy"/> • <F name="AGENT"> • <FS type="player"> • <F name="SURNAME"> • <FS type="Buchwald"/> • <F name="GIVEN_NAME"> • <FS type="Guido"/> • soba#player124:sportevent#FootballPlayer • [sportevent#impersonatedBy -> soba#Guido_BUCHWALD]. • soba#Guido_BUCHWALD:dolce#"natural-person" • [dolce#"HAS-DENOMINATION" -> soba#Guido_BUCHWALD_Denomination]. • soba#Guido_BUCHWALD_Denomination":dolce#"natural-person-denomination" • [dolce#LASTNAME -> "Buchwald"; dolce#FIRSTNAME -> "Guido"]. SProUt F-Logic
A Complex Example semistruct#"Bolivien_vs_Brasilien_09_Oct_05_16_00_Luis_CRISTALDO": sportevent#FieldMatchFootballPlayer [ externalRepresentation@(de) ->> "Luis CRISTALDO (7)"; sportevent#number -> 7; sportevent#impersonatedBy -> semistruct#"Luis_CRISTALDO" ]. semistruct#"Bolivien_vs_Brasilien_09_OCt_05_16_00" [ sportevent#matchEvents -> soba#ID25 ]. soba#ID25:sportevent#Foul [ sportevent#commitedBy -> semistruct#"Bolivien_vs_Brasilien_09_Oct_05_Luis_CRISTALDO ]. mediainst#ID67:media#Picture [ media#URL -> "http://fifaworldcup.yahoo.com/06/de/photos/index.html?aid=124155&d=1"; media#shows -> ID25 ].
Ontology Life-Cycle – Ontology Learning Populate Knowledge Base Generation Validate Consistency Checks Create/Select Development and/or Selection Evolve Extension, Modification Deploy Knowledge Retrieval Maintain Usability Tests
(Multilingual)Synonyms Ontology Learning Layer Cake Rules & Axioms Relations cure(dom:DOCTOR, range:DISEASE) Taxonomy is_a(DOCTOR, PERSON) Concepts DISEASE:=<Int, Ext, Lex> {disease, illness, Krankheit} Terms disease, doctor, hospital Introduced in: Philipp Cimiano, PhD Thesis University of Karlsruhe, forthcoming