190 likes | 301 Views
Finding Domain Terms using Wikipedia. Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu. Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es. Outline. Introduction Related approaches
E N D
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es
Outline • Introduction • Related approaches • Methodology • Evaluation • Conclusions and future work
Introduction • Problem: to automatically extract terminological units from specialized texts • Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
Related approaches • Magnini et al., 2000 • Montoyo et al., 2001 • Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004 • Bernardini et al., 2006 • Cui et al., 2008
Graph structure of Wikipedia WP categories WP pages … … P1 Redirection table A B … … P2 C D E … F Disamb. pages Interwiki links External links InfoBox P3 G … … …
Methodology: overview domain WP top categories Categories Pages domain categories filtering bootstrapping final domain term set domain pages filtering Main steps: 4) Remove proper names and service classes 5) Filter categories and pages 1) To find in WP the domain name as a category. 2) Look for all the subcategories/pages related to the domain 3) Extract all descendants from the domain name avoiding loops
Methodology: filtering Category level Page level
Methodology: filtering Category level Top Category of the Domain Direct super-categories CatSet1 Direct super-categories CatSet1 Direct neutral super-categories C Category Score CatSet1
Methodology: filtering Page level Top Category of the Domain neutral categories categories CatSet2 Pages C ... ... P categories CatSet2 Page Score C CatSet2
Methodology: page filtering • Additionalcategoryfilteringusingpages scores: • catTerm: set of pagesassociatedto a category • MicroStrict: acceptcatif # elements of catTermwith positive scoringisgreaterthat # elementswithnegativescoring • MicroLoose: Idemwithgreaterorequal test. • Macro: instead of countingthepageswith positive/negativescoringwe use thecomponents of such scores.
Page filtering example: “semantics” (in Computing domain) theoretical computer science Computing semantics software software engineering formal methods semantics {linguistics, philosophy of language, semiotics, theoreticalcomputerscience, philosophicalLogic} WPCD(semantics) = 0.25
Category filtering example using pages score: “chemistry”
Evaluation • Partial evaluation: “chemistry” and “astronomy”: • Test against Magnini et al., 2000 (WordNet 1.6) • Low coverage: 25% for Chemistry and 15% for Astronomy • Full evaluation. “Medicine” • Test against SNOMED-CT Spanish Edition (2009) • Wide coverage of the clinical domain: 800K terms
Conclusions • Good results when evaluated against a specialised resource • Term list filtering must be improved (ex. Eliminate proper names)
Future work • Apply this method to other languages/domains • Improve filtering using in/out links of selected pages • Improve filtering using also the page content • Use this WP knowledge to improve a term extractor
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es