220 likes | 380 Views
L-ISA Learning Domain Specific ISA relations from the WEB. Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy. LREC 2008 Marrakech, 31may 2008. Overview.
E N D
L-ISALearning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008 Marrakech, 31may 2008
Overview • Learning ISA relations in the patent processing domain (the PatExpert Project) • The L-ISA algorithm • Evaluation • Future Work
Ontology Learning/Population • Ontology Learning: acquisition of new concepts and relations between them • e.g., a device is an artifact • Ontology Population: acquisition of factual knowledge about specific instances • e.g. Einstein is an instance of a scientist • e.g. Einstein was born in 1879
PATExpert • Funded by the European Union • Aim: improving patent retrieval, summarization, paraphrasing, classification and valuing through shallow and deep semantic analysis • Main semantic analysis task: recognizing occurrences of KB concepts and relations • Proof of the concept on two domains • Optical Recording • Machine Tools • Focus of the presentation: Ontology Learning in the Optical Recording domain
Optical Recording Domain Ontology (ORDO) • Based on the Owl formalism • Built in three stages • 200 hundreds manually crafted concepts: starting from a list of the most frequent terms in a reference corpus • Pro-ISA: ontology learning algorithm based on projection of WordNet fragments onto ORDO • L-ISA: ontology learning algorithm based on acquisition of isa templates from the Web
Patent Concept Annotation Given a target word: • disambiguate it, by assigning a WN synset whose domain is compatible with the optical recording domain (exploiting WORDNET-DOMAINS) • If the synset is linked to an ORDO concept annotate the target word with the ORDO concept • Otherwise: apply Pro-ISA • Otherwise: apply L-ISA
Choosing the right sense Senses for the word “CD”: 1. cadmium, Cd, atomic_number_48 (CHEMISTRY) 2. candle, candela, cd, standard candle (PHISYCS) 3. certificate of deposit, CD (MONEY) 4. compact disk, compact disc, CD (COMPUTER, MUSIC)
Direct Concept Annotation KB-concept synset {compact_disk, compact_disc, CD} ordo:cd CD lemma
Pro-ISA 1:Looking for a WN-to-ORDO link {event} sumo:Process {happening, occurrence, occurrent, natural_event} {trouble} {noise, interference, disturbance} {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} cross-talk Lemma:
Pro-ISA 2: Projecting ISA chains (WN -> ORDO) {event} sumo:Process {happening, occurrence, occurrent, natural_event} auto_ordo:happening auto_ordo:trouble {trouble} {noise, interference, disturbance} auto_ordo:noise {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} auto_ordo:crosstalk cross-talk
From Pro-ISA to L-ISA • In 15% of cases, the target word is not in WordNet, so Pro-ISA cannot be applied • Then try and exploit the WEB • Why not the patent corpus itself? • Isa relations are not frequent in restricted corpora • Patents often contain concept definitions with local scope • We don’t want idiosyncratic concept definitions, but common, shared definitions.
Learning ISA relations from a corpus … • … by exploiting linguistic patterns expressing the ISA relation (Hearst, 1992; Hearst, 1998; Mititelu, 2006) • Many patterns have been presented in the literature, but • Few evaluations of the pattern reliability (except Snow 2006) • Even less task-oriented evaluation in domain specific, concrete application scenarios. • This paper: attempt to provide both kind of evaluations in a real-word, challenging scenario such as patent semantic analysis.
Lexico-Syntactic Patterns • Patterns reported in the literature sequence of tokens NP1 isa-phrase NP2 syntactic noun phrases • In our case we are looking for the hypernym of a specific target term Term-NP isa-phrase Hyper-NP Hyper-NP isa-phrase Term-NP
L-ISA • Google (or any other web engine) does not allow for searching lexico-syntactic patterns… • So, we proceed in three steps • Snippet acquisition from Google • Lexico-syntactic filtering • Semantic filtering
L-ISA: Snippet Acquisition • Suppose we cannot link the term “photodetector” to any ORDO concept. • We want to exploit the following lexico-syntactic pattern: <TERM-NP> “is an” <HYPER-NP> • Submit to Google the following string query: “photodetector is an” • Keep the first 100 snippets (at most), e.g. “... upper frequencies, the PIN waveguide photodetector is an attractive device, since it is possible to reduce transit time without ..” • Transform HTML snippets in pure text.
L-ISA: Lexico-syntactic Filtering • Annotate snippets with TextPro (PoS, lemma, chunk) • Recognize <Term-NP> isa-phrase <Hyper-NP> in the annotated snippets
L-ISA: Lexico-syntactic Filtering • Filter out TERM-NP : • if target term is modified (e.g. “PIN waveguide photodetector” above) • if it looks like a proper names (e.g. uppercase letter in the middle of a sentence). • Keep HYPER-NP: • only if it fits a restricted number of PoS-pattern: (N | AN | NN | NNN | ANN | XNN | R Vpastpart AXN)
Semantic Filtering • Keep only those HYPER-NPs compatible with the Optical Recording domain, by checking • whether the HYPER-NP is already a label in one of the known ontologies (SUMO, ORDO, AUTO-ORDO) • whether it is present in a WordNet synset with a WORDNET-DOMAIN label compatible with the Optical Recording domain. Candidate hypernyms for photodetctor
Candidate Selection • Candidates are weighed according to • Frequency and Reliability of patterns where the hypernym occurs • Variety of patterns • Belonging to specific ontologies (manual ORDO, AUTO-ORDO or SUMO, in decreasing preference order)
Evaluating the Reliability of ISA Patterns • Assessement of the reliability of the patterns reported in the literature as predictors of the isa relation • Around 80 templates • On three target terms: “groove”, “photodetector” and “magnetic head”. • Google returned around 9.000 snippets • Only snippets passing lexico-syntactic filtering have been actually manually evaluated (about 1,450) • Guideline: try to interpret the intentions of the author (does he/she really intende to say that X isa subclass of Y, beyond inappropriate phrasing, and even if you know that it is not true?) • Results of this evaluation exploited in weighting the hypernym candidates
Evaluating the L-ISA accuracy • Measuring the accuracy of the L-ISA algorithm in finding the hypernym of a given domain concept • Most frequent 100 terms that we were not able to link to the ORDO ontology using the Pro-ISA learning strategy • Including “wrong” target terms (because of errors of the linguistic processors, e.g. a past participle instead of a noun) • Accuracy: 78.6%
Future Work • Extend evaluation set • Inter-coder agreement • Use Machine Learning to optimize the weights associated to templates