1 / 22

L-ISA Learning Domain Specific ISA relations from the WEB

L-ISA Learning Domain Specific ISA relations from the WEB. Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy. LREC 2008 Marrakech, 31may 2008. Overview.

Download Presentation

L-ISA Learning Domain Specific ISA relations from the WEB

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. L-ISALearning Domain Specific ISA relations from the WEB Alessandra Potrich and Emanuele Pianta Fondazione Bruno Kessler - IRST Trento, Italy LREC 2008 Marrakech, 31may 2008

  2. Overview • Learning ISA relations in the patent processing domain (the PatExpert Project) • The L-ISA algorithm • Evaluation • Future Work

  3. Ontology Learning/Population • Ontology Learning: acquisition of new concepts and relations between them • e.g., a device is an artifact • Ontology Population: acquisition of factual knowledge about specific instances • e.g. Einstein is an instance of a scientist • e.g. Einstein was born in 1879

  4. PATExpert • Funded by the European Union • Aim: improving patent retrieval, summarization, paraphrasing, classification and valuing through shallow and deep semantic analysis • Main semantic analysis task: recognizing occurrences of KB concepts and relations • Proof of the concept on two domains • Optical Recording • Machine Tools • Focus of the presentation: Ontology Learning in the Optical Recording domain

  5. Optical Recording Domain Ontology (ORDO) • Based on the Owl formalism • Built in three stages • 200 hundreds manually crafted concepts: starting from a list of the most frequent terms in a reference corpus • Pro-ISA: ontology learning algorithm based on projection of WordNet fragments onto ORDO • L-ISA: ontology learning algorithm based on acquisition of isa templates from the Web

  6. Patent Concept Annotation Given a target word: • disambiguate it, by assigning a WN synset whose domain is compatible with the optical recording domain (exploiting WORDNET-DOMAINS) • If the synset is linked to an ORDO concept annotate the target word with the ORDO concept • Otherwise: apply Pro-ISA • Otherwise: apply L-ISA

  7. Choosing the right sense Senses for the word “CD”: 1. cadmium, Cd, atomic_number_48 (CHEMISTRY) 2. candle, candela, cd, standard candle (PHISYCS) 3. certificate of deposit, CD (MONEY) 4. compact disk, compact disc, CD (COMPUTER, MUSIC)

  8. Direct Concept Annotation KB-concept synset {compact_disk, compact_disc, CD} ordo:cd CD lemma

  9. Pro-ISA 1:Looking for a WN-to-ORDO link {event} sumo:Process {happening, occurrence, occurrent, natural_event} {trouble} {noise, interference, disturbance} {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} cross-talk Lemma:

  10. Pro-ISA 2: Projecting ISA chains (WN -> ORDO) {event} sumo:Process {happening, occurrence, occurrent, natural_event} auto_ordo:happening auto_ordo:trouble {trouble} {noise, interference, disturbance} auto_ordo:noise {crosstalk, XT} -{cross_talk, cross-talk, crosstalk_amount} auto_ordo:crosstalk cross-talk

  11. From Pro-ISA to L-ISA • In 15% of cases, the target word is not in WordNet, so Pro-ISA cannot be applied • Then try and exploit the WEB • Why not the patent corpus itself? • Isa relations are not frequent in restricted corpora • Patents often contain concept definitions with local scope • We don’t want idiosyncratic concept definitions, but common, shared definitions.

  12. Learning ISA relations from a corpus … • … by exploiting linguistic patterns expressing the ISA relation (Hearst, 1992; Hearst, 1998; Mititelu, 2006) • Many patterns have been presented in the literature, but • Few evaluations of the pattern reliability (except Snow 2006) • Even less task-oriented evaluation in domain specific, concrete application scenarios. • This paper: attempt to provide both kind of evaluations in a real-word, challenging scenario such as patent semantic analysis.

  13. Lexico-Syntactic Patterns • Patterns reported in the literature sequence of tokens NP1 isa-phrase NP2 syntactic noun phrases • In our case we are looking for the hypernym of a specific target term Term-NP isa-phrase Hyper-NP Hyper-NP isa-phrase Term-NP

  14. L-ISA • Google (or any other web engine) does not allow for searching lexico-syntactic patterns… • So, we proceed in three steps • Snippet acquisition from Google • Lexico-syntactic filtering • Semantic filtering

  15. L-ISA: Snippet Acquisition • Suppose we cannot link the term “photodetector” to any ORDO concept. • We want to exploit the following lexico-syntactic pattern: <TERM-NP> “is an” <HYPER-NP> • Submit to Google the following string query: “photodetector is an” • Keep the first 100 snippets (at most), e.g. “... upper frequencies, the PIN waveguide photodetector is an attractive device, since it is possible to reduce transit time without ..” • Transform HTML snippets in pure text.

  16. L-ISA: Lexico-syntactic Filtering • Annotate snippets with TextPro (PoS, lemma, chunk) • Recognize <Term-NP> isa-phrase <Hyper-NP> in the annotated snippets

  17. L-ISA: Lexico-syntactic Filtering • Filter out TERM-NP : • if target term is modified (e.g. “PIN waveguide photodetector” above) • if it looks like a proper names (e.g. uppercase letter in the middle of a sentence). • Keep HYPER-NP: • only if it fits a restricted number of PoS-pattern: (N | AN | NN | NNN | ANN | XNN | R Vpastpart AXN)

  18. Semantic Filtering • Keep only those HYPER-NPs compatible with the Optical Recording domain, by checking • whether the HYPER-NP is already a label in one of the known ontologies (SUMO, ORDO, AUTO-ORDO) • whether it is present in a WordNet synset with a WORDNET-DOMAIN label compatible with the Optical Recording domain. Candidate hypernyms for photodetctor

  19. Candidate Selection • Candidates are weighed according to • Frequency and Reliability of patterns where the hypernym occurs • Variety of patterns • Belonging to specific ontologies (manual ORDO, AUTO-ORDO or SUMO, in decreasing preference order)

  20. Evaluating the Reliability of ISA Patterns • Assessement of the reliability of the patterns reported in the literature as predictors of the isa relation • Around 80 templates • On three target terms: “groove”, “photodetector” and “magnetic head”. • Google returned around 9.000 snippets • Only snippets passing lexico-syntactic filtering have been actually manually evaluated (about 1,450) • Guideline: try to interpret the intentions of the author (does he/she really intende to say that X isa subclass of Y, beyond inappropriate phrasing, and even if you know that it is not true?) • Results of this evaluation exploited in weighting the hypernym candidates

  21. Evaluating the L-ISA accuracy • Measuring the accuracy of the L-ISA algorithm in finding the hypernym of a given domain concept • Most frequent 100 terms that we were not able to link to the ORDO ontology using the Pro-ISA learning strategy • Including “wrong” target terms (because of errors of the linguistic processors, e.g. a past participle instead of a noun) • Accuracy: 78.6%

  22. Future Work • Extend evaluation set • Inter-coder agreement • Use Machine Learning to optimize the weights associated to templates

More Related