1 / 19

Finding Domain Terms using Wikipedia

Finding Domain Terms using Wikipedia. Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu. Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es. Outline. Introduction Related approaches

grace
Download Presentation

Finding Domain Terms using Wikipedia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es

  2. Outline • Introduction • Related approaches • Methodology • Evaluation • Conclusions and future work

  3. Introduction • Problem: to automatically extract terminological units from specialized texts • Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

  4. Related approaches • Magnini et al., 2000 • Montoyo et al., 2001 • Missikoff et al., 2002 • Vivaldi, Rodríguez, 2002 • Vivaldi, Rodríguez, 2004 • Bernardini et al., 2006 • Cui et al., 2008

  5. Graph structure of Wikipedia WP categories WP pages … … P1 Redirection table A B … … P2 C D E … F Disamb. pages Interwiki links External links InfoBox P3 G … … …

  6. Methodology: overview domain WP top categories Categories Pages domain categories filtering bootstrapping final domain term set domain pages filtering Main steps: 4) Remove proper names and service classes 5) Filter categories and pages 1) To find in WP the domain name as a category. 2) Look for all the subcategories/pages related to the domain 3) Extract all descendants from the domain name avoiding loops

  7. Methodology: filtering Category level Page level

  8. Methodology: filtering Category level Top Category of the Domain Direct super-categories CatSet1 Direct super-categories CatSet1 Direct neutral super-categories C Category Score CatSet1

  9. Methodology: filtering Page level Top Category of the Domain neutral categories categories CatSet2 Pages C ... ... P categories CatSet2 Page Score C CatSet2

  10. Methodology: category filtering

  11. Methodology: page filtering • Additionalcategoryfilteringusingpages scores: • catTerm: set of pagesassociatedto a category • MicroStrict: acceptcatif # elements of catTermwith positive scoringisgreaterthat # elementswithnegativescoring • MicroLoose: Idemwithgreaterorequal test. • Macro: instead of countingthepageswith positive/negativescoringwe use thecomponents of such scores.

  12. Page filtering example: “semantics” (in Computing domain) theoretical computer science  Computing  semantics software software engineering formal methods semantics {linguistics, philosophy of language, semiotics, theoreticalcomputerscience, philosophicalLogic} WPCD(semantics) = 0.25

  13. Category filtering example using pages score: “chemistry”

  14. Evaluation • Partial evaluation: “chemistry” and “astronomy”: • Test against Magnini et al., 2000 (WordNet 1.6) • Low coverage: 25% for Chemistry and 15% for Astronomy • Full evaluation. “Medicine” • Test against SNOMED-CT Spanish Edition (2009) • Wide coverage of the clinical domain: 800K terms

  15. Partial evaluation

  16. Full evaluation

  17. Conclusions • Good results when evaluated against a specialised resource • Term list filtering must be improved (ex. Eliminate proper names)

  18. Future work • Apply this method to other languages/domains • Improve filtering using in/out links of selected pages • Improve filtering using also the page content • Use this WP knowledge to improve a term extractor

  19. Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es

More Related