460 likes | 504 Views
Knowledge Extraction based on Discourse Representation Theory and Linguistic Frames. Valentina Presutti 1 , Francesco Draicchio 1 , Aldo Gangemi 1,2 1 STLab, ISTC-CNR, Rome (IT) 2 LIPN, Un. Paris 13-CNRS-Sorbonne Cité (FR). Machine reading with FRED.
E N D
Knowledge Extraction based on Discourse Representation Theory and Linguistic Frames • Valentina Presutti1, Francesco Draicchio1, Aldo Gangemi1,2 • 1STLab, ISTC-CNR, Rome (IT) • 2LIPN, Un. Paris 13-CNRS-Sorbonne Cité (FR)
Machine reading with FRED The New York Times reported that John McCarthy died. He invented the programming language LISP. Frames/events Meta-level Semantic roles Custom namespace Types Taxonomy Vocabulary alignment Co-reference Resolution and linking Visit us later today at the demo session!
Outline • Robust Ontology Learning (ROL) • Statistical approaches • Rule-based approaches • Boxer-based frame detection – comparative evaluation • From NL to RDF-OWL and linked data • Ongoing work/evaluation, conclusions
Robust ontology learning (ROL) • Fast and accurate NL to RDF/OWL transformation • Mid-to-strong variety of machine reading • Good design quality of the resulting ontologies • more than entity + relations: aggregated data (frames, events) vs. sparse data • Frame-based representation ⟶ Good design quality • [Coppola et al., ESWC-2009] shows that n-ary ontology design patterns (ODPs) can be easily derived from frames, and have equivalent conceptual expressivity (and have formal semantics in addition) • [Blomqvist, ISWC-2009] provides evidence that OL methods performances improve if the learning cycle is augmented with ODPs
Requirements for ROL on the Web • Ability to map natural language (Web of documents, still the major part) to RDF/OWL representations (Web of data) • Ability to capture accurate semantic structures (e.g. complex relations or frames) • Easy adaptation to the principles of linked data publishing (IRI, links) • Minimal computing time
Central question • Can we map natural language to accurate and complete logical form for the web, in reasonable time? • State-of-art statistical frame detection is not yet generalizable and fast; OIE relation extraction is not accurate and complete enough yet • State-of-art rule-based frame detection is more generalizable, reasonably fast and decently accurate/complete, but it is not yet portable to Web standards, nor integrated with resolution and linking, brittle with wild texts • Warning: full-fledged (average human) NLU requires too many layers to be considered achievable at this time in history
State of the art: what CAN we do with statistical approaches?
Kinds of text analyses • Leading example: • “In early 1527, Cabeza De Vaca departed Spain as the treasurer of the Narvaez royal expedition to occupy the mainland of North America. After landing near Tampa Bay, Florida on April 15, 1528, Cabeza De Vaca and three other men would be the only survivors of the expedition party of 600 men.”
Shallow parsing (Alchemy) Language recognition Relation (action) extraction • Language • english • Person (1) • the only survivors • GeographicFeature (1) • Tampa Bay • Continent (1) • North America • Country (1) • Spain • StateOrCounty (1) • Florida • Concept Tags (8) • United States • Florida • Gulf of Mexico • North America • Tampa, Florida • Narváez expedition • American Civil War • Tampa Bay • Category • Culture & Politics (confidence: 0.6598) Named entity recognition Concept tagging Subject classification
Entity resolution (Semiosearch wikifier) Wikipedia-based topic detection Named entity resolution
Relation extraction with ReVerb (Etzioni et al.) • Cabeza De Vaca ___departed___Spain • 0.9059944442645228 • In early 1527 , Cabeza De Vaca departed Spain as the treasurer of the Narvaez royal expedition to occupy the mainland of North America . • IN JJ CD , NNP NNP NNP VBD NNP IN DT NN IN DT NNP JJ NN TO VB DT NN IN NNP NNP . • B-PP B-NP I-NP O B-NP I-NP I-NP B-VP B-NP B-PP B-NP I-NP I-NP I-NP I-NP I-NP I-NP B-VP I-VP B-NP I-NP I-NP I-NP I-NP O • cabeza de vaca___depart___spain • three other men___would be the only survivors of___the expedition party of 600 men • 0.36592485864619345 • After landing near Tampa Bay , Florida on April 15 , 1528 , Cabeza De Vaca and three other men would be the only survivors of the expedition party of 600 men . • IN NN IN NNP NNP , NNP IN NNP CD , CD , NNP NNP NNP CC CD JJ NNS MD VB DT JJ NNS IN DT NN NN IN CD NNS . • B-PP B-NP B-PP B-NP I-NP O B-NP B-PP B-NP I-NP I-NP I-NP O B-NP I-NP I-NP O B-NP I-NP I-NP B-VP I-VP B-NP I-NP I-NP I-NP I-NP I-NP I-NP I-NP I-NP I-NP O • # other men___be survivor of___the expedition party of # men
Frame detection with Semafor • Probabilistic frame parser (Das et al.)
Rule-based Approaches: Deep parsing, computational semantics, DRT
Deep parsing • Usage of syntactic parse trees • Formal semantic interpretation on top of syntax w(1, 1, 'In', 'in', 'IN', 'I-PP', 'O', '(S[X]/S[X])/NP'). w(1, 2, 'early', 'early', 'JJ', 'I-NP', 'I-DAT', 'N/N'). w(1, 3, '1527', '1527', 'CD', 'I-NP', 'I-DAT', 'N'). w(1, 4, ',', ',', ',', 'O', 'I-DAT', ','). w(1, 5, 'Cabeza', 'Cabeza', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 6, 'De', 'De', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 7, 'Vaca', 'Vaca', 'NNP', 'I-NP', 'I-ORG', 'N'). w(1, 8, 'departed', 'depart', 'VBD', 'I-VP', 'O', '((S[dcl]\NP)/PP)/NP'). w(1, 9, 'Spain', 'Spain', 'NNP', 'I-NP', 'O', 'N'). w(1, 10, 'as', 'as', 'IN', 'I-PP', 'O', 'PP/NP'). w(1, 11, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 12, 'treasurer', 'treasurer', 'NN', 'I-NP', 'O', 'N'). w(1, 13, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 14, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 15, 'Narvaez', 'Narvaez', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 16, 'royal', 'royal', 'NN', 'I-NP', 'O', 'N/N'). w(1, 17, 'expedition', 'expedition', 'NN', 'I-NP', 'O', 'N'). w(1, 18, 'to', 'to', 'TO', 'I-VP', 'O', '(S[to]\NP)/(S[b]\NP)'). w(1, 19, 'occupy', 'occupy', 'VB', 'I-VP', 'O', '(S[b]\NP)/NP'). w(1, 20, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 21, 'mainland', 'mainland', 'NN', 'I-NP', 'O', 'N'). w(1, 22, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 23, 'North', 'North', 'NNP', 'I-NP', 'I-LOC', 'N/N'). w(1, 24, 'America', 'America', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 25, '.', '.', '.', 'O', 'O', '.'). w(1, 26, 'After', 'after', 'IN', 'I-PP', 'O', '(S[X]/S[X])/NP'). w(1, 27, 'landing', 'landing', 'NN', 'I-NP', 'O', 'N'). w(1, 28, 'near', 'near', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 29, 'Tampa', 'Tampa', 'NNP', 'I-NP', 'I-LOC', 'N/N'). w(1, 30, 'Bay', 'Bay', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 31, ',', ',', ',', 'O', 'O', ','). w(1, 32, 'Florida', 'Florida', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 33, 'on', 'on', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 34, 'April', 'April', 'NNP', 'I-NP', 'I-DAT', 'N/N[num]'). w(1, 35, '15', '15', 'CD', 'I-NP', 'I-DAT', 'N[num]'). w(1, 36, ',', ',', ',', 'I-NP', 'I-DAT', ','). w(1, 37, '1528', '1528', 'CD', 'I-NP', 'I-DAT', 'N\N'). w(1, 38, ',', ',', ',', 'O', 'O', ','). w(1, 39, 'Cabeza', 'Cabeza', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 40, 'De', 'De', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 41, 'Vaca', 'Vaca', 'NNP', 'I-NP', 'I-ORG', 'N'). w(1, 42, 'and', 'and', 'CC', 'O', 'O', 'conj'). w(1, 43, 'three', 'three', 'CD', 'I-NP', 'O', 'N/N'). w(1, 44, 'other', 'other', 'JJ', 'I-NP', 'O', 'N/N'). w(1, 45, 'men', 'man', 'NNS', 'I-NP', 'O', 'N'). w(1, 46, 'would', 'would', 'MD', 'I-VP', 'O', '(S[dcl]\NP)/(S[b]\NP)'). w(1, 47, 'be', 'be', 'VB', 'I-VP', 'O', '(S[b]\NP)/NP'). w(1, 48, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 49, 'only', 'only', 'JJ', 'I-NP', 'O', 'N/N'). w(1, 50, 'survivors', 'survivor', 'NNS', 'I-NP', 'O', 'N'). w(1, 51, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 52, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 53, 'expedition', 'expedition', 'NN', 'I-NP', 'O', 'N/N'). w(1, 54, 'party', 'party', 'NN', 'I-NP', 'O', 'N'). w(1, 55, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 56, '600', '600', 'CD', 'I-NP', 'O', 'N/N'). w(1, 57, 'men', 'man', 'NNS', 'I-NP', 'O', 'N'). w(1, 58, '.', '.', '.', 'O', 'O', '.').
Categorial parsing (c&c) w(1, 1, 'In', 'in', 'IN', 'I-PP', 'O', '(S[X]/S[X])/NP'). w(1, 2, 'early', 'early', 'JJ', 'I-NP', 'I-DAT', 'N/N'). w(1, 3, '1527', '1527', 'CD', 'I-NP', 'I-DAT', 'N'). w(1, 4, ',', ',', ',', 'O', 'I-DAT', ','). w(1, 5, 'Cabeza', 'Cabeza', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 6, 'De', 'De', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 7, 'Vaca', 'Vaca', 'NNP', 'I-NP', 'I-ORG', 'N'). w(1, 8, 'departed', 'depart', 'VBD', 'I-VP', 'O', '((S[dcl]\NP)/PP)/NP'). w(1, 9, 'Spain', 'Spain', 'NNP', 'I-NP', 'O', 'N'). w(1, 10, 'as', 'as', 'IN', 'I-PP', 'O', 'PP/NP'). w(1, 11, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 12, 'treasurer', 'treasurer', 'NN', 'I-NP', 'O', 'N'). w(1, 13, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 14, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 15, 'Narvaez', 'Narvaez', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 16, 'royal', 'royal', 'NN', 'I-NP', 'O', 'N/N'). w(1, 17, 'expedition', 'expedition', 'NN', 'I-NP', 'O', 'N'). w(1, 18, 'to', 'to', 'TO', 'I-VP', 'O', '(S[to]\NP)/(S[b]\NP)'). w(1, 19, 'occupy', 'occupy', 'VB', 'I-VP', 'O', '(S[b]\NP)/NP'). w(1, 20, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 21, 'mainland', 'mainland', 'NN', 'I-NP', 'O', 'N'). w(1, 22, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 23, 'North', 'North', 'NNP', 'I-NP', 'I-LOC', 'N/N'). w(1, 24, 'America', 'America', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 25, '.', '.', '.', 'O', 'O', '.'). w(1, 26, 'After', 'after', 'IN', 'I-PP', 'O', '(S[X]/S[X])/NP'). w(1, 27, 'landing', 'landing', 'NN', 'I-NP', 'O', 'N'). w(1, 28, 'near', 'near', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 29, 'Tampa', 'Tampa', 'NNP', 'I-NP', 'I-LOC', 'N/N'). w(1, 30, 'Bay', 'Bay', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 31, ',', ',', ',', 'O', 'O', ','). w(1, 32, 'Florida', 'Florida', 'NNP', 'I-NP', 'I-LOC', 'N'). w(1, 33, 'on', 'on', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 34, 'April', 'April', 'NNP', 'I-NP', 'I-DAT', 'N/N[num]'). w(1, 35, '15', '15', 'CD', 'I-NP', 'I-DAT', 'N[num]'). w(1, 36, ',', ',', ',', 'I-NP', 'I-DAT', ','). w(1, 37, '1528', '1528', 'CD', 'I-NP', 'I-DAT', 'N\N'). w(1, 38, ',', ',', ',', 'O', 'O', ','). w(1, 39, 'Cabeza', 'Cabeza', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 40, 'De', 'De', 'NNP', 'I-NP', 'I-ORG', 'N/N'). w(1, 41, 'Vaca', 'Vaca', 'NNP', 'I-NP', 'I-ORG', 'N'). w(1, 42, 'and', 'and', 'CC', 'O', 'O', 'conj'). w(1, 43, 'three', 'three', 'CD', 'I-NP', 'O', 'N/N'). w(1, 44, 'other', 'other', 'JJ', 'I-NP', 'O', 'N/N'). w(1, 45, 'men', 'man', 'NNS', 'I-NP', 'O', 'N'). w(1, 46, 'would', 'would', 'MD', 'I-VP', 'O', '(S[dcl]\NP)/(S[b]\NP)'). w(1, 47, 'be', 'be', 'VB', 'I-VP', 'O', '(S[b]\NP)/NP'). w(1, 48, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 49, 'only', 'only', 'JJ', 'I-NP', 'O', 'N/N'). w(1, 50, 'survivors', 'survivor', 'NNS', 'I-NP', 'O', 'N'). w(1, 51, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 52, 'the', 'the', 'DT', 'I-NP', 'O', 'NP[nb]/N'). w(1, 53, 'expedition', 'expedition', 'NN', 'I-NP', 'O', 'N/N'). w(1, 54, 'party', 'party', 'NN', 'I-NP', 'O', 'N'). w(1, 55, 'of', 'of', 'IN', 'I-PP', 'O', '(NP\NP)/NP'). w(1, 56, '600', '600', 'CD', 'I-NP', 'O', 'N/N'). w(1, 57, 'men', 'man', 'NNS', 'I-NP', 'O', 'N'). w(1, 58, '.', '.', '.', 'O', 'O', '.').
DRT • Discourse Representation Theory • Hans Kamp: “A theory of truth and semantic representation”, 1981 • A sentence meaning is taken to be an update operation on a context • DRT represents “the discourse context” as a discourse representation structure (or DRS). A DRS includes: • A set of referents: the entities which have been introduced into the context • A set of conditions: the predicates which are known to hold of these entities • Basically (a fragment of) FOL • Simplified quantification • DRT bound to visible (“in praesentia”) linguistic context • “Every look he gives you, I get sicker and sicker.” (Tristan to Isolde) 20
Boxer • Implementation of computational semantics (J. Bos) with DRT output and Davidsonian predicates: reification of n-ary relations • Semantic role labellingwithProto, VerbNet or FrameNet roles • Pragmatic grasp, with statistical NER and sense tagging (M. Ciaramita’s SST), tense logic, co-reference, presupposition, sentence integration, entailment, ...
Extending Boxer for frame detection • Minimal adaptation of Boxer • Boxer reference knowledge base enriched with updated Semlink (VN2FN) mappings • Boxer uses VerbNet roles by default • There is still room for improving the detection algorithm (e.g., frame disambiguation) • Currently Boxer selects the first candidate frame with no ranking criteria
Frame detection: comparative evaluation • SEMAFOR vs. BOXER • SEMAFOR is the best performing tool for frame detection so far • Task 19 (Semeval ‘07): frame detection task • Given a textual sentence, the system has to automatically extract facts from it, and predict FrameNet frame structures that best fit those facts.
Evaluation setting 1/2 • SEMAFOR was trained on the Semeval ‘07 target corpus after the challenge, hence • We have built a new benchmark: • FrameNet annotated corpus of sample frame sentences • 1214 sentences, each fully annotated with at least one frame • Benchmark automatically generated by randomly selecting dataset sentences among the whole set of annotated samples • Measures: precision, recall, coverage, and time-efficiency.
Evaluation setting 2/2 • We have built an evaluation component implementing the same criteria used in the Semeval ‘07 task evaluation: • It compares the results of the two systems against the gold standard provided by the benchmark(1) (1) frame detection only from verbs
Boxer is much faster Substantial increase of Boxer precision SEMAFOR Has better F-Score Comparable performance Evaluation results Only exact match Exact as well as Partial match
Boxer currently exploits only partial FrameNet knowledge while SEMAFOR encodes most of it by being trained on FrameNet data Boxer coverage can be improved by integrating FrameNet-LOD The detection algorithm can be further improved by (i) enhancing it by frame disambiguation (ii) extending the set of Semlink mappings between VerbNet and FrameNet They show that a rule-based approach can easily reach performances that are comparable to the current best frame detection tool based on a probabilistic approach Why are these good results? Reduced coverage can be explained and fixed:
The benchmark used put SEMAFOR in a condition of advantage vs. Boxer SEMAFOR has been trained on FrameNet full text annotations No need of training for Boxer: Using SEMAFOR (or other probabilistic-based tools) on corpora independent from FrameNet would need proper training Why are these good results?
Output format • A main problem of existing (probabilistic) approaches to frame detection is that they do not provide a logical form as output • Boxer provides a DRT-based logical representation of natural language • FRED translates/refactors Boxer output to OWL/RDF, and enriches it with NER, resolution, WSD, linking, alignment
The leading example • In early 1527, Cabeza De Vaca departed Spain as the treasurer of the Narvaez royal expedition to occupy the mainland of North America. After landing near Tampa Bay, Florida on April 15, 1528, Cabeza De Vaca and three other men would be the only survivors of the expedition party of 600 men. • What happens if we submit it to FRED?
FRED on STLab tools(1) (1) http://wit.istc.cnr.it/stlab-tools/fred/
FRED RDF graph output Frames/events Semantic roles Custom namespace Co-reference Vocabulary alignment Types Taxonomy Resolution and linking
Another example The New York Times reported that John McCarthy died. He invented the programming language LISP. here the interesting part is about resolving co-reference across the two sentences, and dealing with “surface” meta-level (the declarative proposition of reporting)
Rule-based DRT2OWL • Translation rules:
Rule-based DRT2OWL • Boxer built-in types(1) alignment to Semantic Web entities (1) http://www.ontologydesignpatterns.org/ont/ boxer/boxer.owl
Issues when porting (Boxer) DRT to RDF • Discourse referent variables: implicit or explicit? • No terminology recognition/extraction • No term compositionality • No periphrastic relations for properties (e.g. survivorOf) • Redundant variables • Missing definitional pragmatics • Implicit local restrictions • Many types of redundant “boxing” for embedded propositions, non-standard negation, etc. • How to effectively map to RDF/OWL?
Heuristics to map Boxer DRT to RDF (1/2) • Default predicates • H_pred1: special vocabulary for defaults, e.g. boxer:Perrdf:type owl:Class • H_pred2: mappings to existing vocabularies, e.g. foaf:Person ; dul:associatedWith ; time:Interval • Default roles (SRL binary predicates) • H_rol: VerbNet/FrameNet vocabularies, e.g. verbnet:agent rdf:type owl:ObjectProperty • Domain predicates • H_dom: default namespace, customizable, e.g. domain:Expedition ; domain:of • Discourse referent variables: implicit or explicit? • H_ref: only referents of predicates are materialized, e.g. domain:survivor_1 rdf:type domain:Survivor ; depart_1 rdf:type domain:Depart • Terminology recognition/extraction • H_term: co-referential predicate merging, e.g. domain:ExpeditionParty • Term compositionality • H_termc: create taxonomies by “unmerging” predicates, e.g. domain:ExpeditionParty rdfs:subClassOf domain:Party • Periphrastic relations • H_peri: generate “webby” properties, e.g. domain:survivorOf rdf:type owl:ObjectProperty
Heuristics to map Boxer DRT to RDF (2/2) • Getting rid of redundant variables • H_red: unify variables based on local Unique Name Assumption • Boolean constructs • H_boo: special relations between domain predicates or individuals, e.g. domain:man_1 owl:differentFrom domain:man_2 • Propositions as arguments (meta-level) • H_pro: special relation between events, e.g. domain:report_1 vn:theme domain:depart_1 • Definitional pragmatics • H_def: bypassing discourse referents and forcing universal quantification, e.g. domain:WindInstrument rdfs:subClassOf domain:MusicalInstrument • Implicit local restrictions (optional, when task wants it) • H_res: inducing anonymous classes, e.g. domain:WindInstrument rdfs:subClassOf (locationOf some domain:Contain) • Heuristics implemented as rules (currently via scripts, next as Stanbol Rules / SPARQL1.1)
Hybridisation • Identity of entities or schema elements, and entity linking are crucial for LOD production • NER and entity resolution to DBpedia with Semiosearch Wikifier (trained onWikipedia) • WSD to e.g. WordNet-OWL with off-the-shelf NLP components (e.g. UKB, trained on WN) • Vocabulary alignment (FOAF, SKOS, DBPO, DOLCE+DnS Ultralite, …)
FRED • Available online at http://wit.istc.cnr.it/stlab-tools/fred/ • Available as web service http://stlab.istc.cnr.it/stlab/FRED • Soon available as open source project
News: Tìpalo(1) – a FRED application • Automatic typing of Wikipedia entities based on FRED • Results are very good • An indirect evaluation of FRED performances • To be presented at ISWC 2012 • Preview demo available today and online(1) (1) http://wit.istc.cnr.it/stlab-tools/tipalo
Ongoing and future work • Evaluate FRED results based on domain/task coverage and accuracy wrt ontology design criteria • Compare FRED to other ROL tools, e.g., • OntoCase (uses ODP for augmenting Text2Onto results) • LODifier (based on DRT) • More heuristics e.g. for special text types • Multilinguality • Compliance to NIF
FRED online @ http://wit.istc.cnr.it/stlab-tools/fred/ Questions? Don’t miss FRED demo today!