1 / 24

Unsupervised multi–word term recognition in Welsh

Learn about multi-word term recognition in Welsh, linguistic filtering, cost criteria, term variation, and how FlexiTerm provides flexibile term recognition and normalization in Welsh text.

aronv
Download Presentation

Unsupervised multi–word term recognition in Welsh

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unsupervised multi–word term recognition in Welsh Prof. Irena Spasićspasici@cardiff.ac.uk

  2. Terms • What are terms? • means of conveying scientific & technical information • linguistic representations of domain-specific concepts • e.g. tablet

  3. Multi–word terms • computer science recurrent neural network (RNN) • mathematics dot product • biology stem cell • chemistry fatty acid • medicine chronic obstructive pulmonary disease (COPD) • law reasonable doubt • economics quasi-autonomous non-government organisation (QUANGO) • intelligence weapon of mass distraction (WMD)

  4. Collocation • combination of words that co-occur more often than would be expected by chance

  5. Problems • potentially unlimited number of domains • dynamic nature of some domains • computer science: generative adversarial network • medicine: swine flu • dictionaries are not always up to date • user–generated content such as blogs, where lay users use non–standard terminology • medicine: full knee replacement  total knee replacement (TKR) • dictionaries are not always suitable

  6. Alternatives • automatic term recognition (ATR) • recognising terms in text without a dictionary • potentially distinctive properties • syntactic structure • frequency distribution • approaches • tagging/parsing + pattern matching • counting

  7. Linguistic filtering (Justeson & Katz, 1995) • preferred phrase structures • terms are mostly noun phrasescontaining adjectives, nouns, possessives and prepositions • ( JJ | NN )+ NN • e.g. mean/NN squared/JJ error/NN • ( NN | JJ )* NN POS ( NN | JJ )* NN • e.g. Zipf/NN 's/POS law/NN • ( NN | JJ )* NN IN ( NN | JJ )* NN • e.g. law/NN of/IN large/JJ numbers/NN

  8. Cost criteria (Kita et al, 1994) • collocations are recurrent word sequences • recurrence is captured by the absolute frequency • a simple absolute frequency approach does notwork! • frequency(sub-sequence) > frequency(sequence) • e.g. f('in spite')  f('in spite of') • cost: K() = (||  1)  (f() f()) • ,  ... word sequences,  = uv • || ... length (number of words in ) • f() ... frequency of 

  9. Multi–word term recognition • hybrid solution • linguistic filters are used to extract candidate terms • ... which are then ranked using cost–like criteria • C-value (Frantzi & Ananiadou, 1999; Nenadić, Spasić & Ananiadou, 2002) • e.g. anterior cruciate ligament, posterior cruciate ligament • the method favours longer, more frequently and independently occurring term candidates

  10. Term variation • C–value works well when terms are used consistently, i.e. when they do not vary in structure and content • however, terms may vary: • orthographic variation, e.g. posterolateral corner vs. postero–lateral corner vs. postero lateral corner • morphological variationinflection, e.g. lateral meniscus vs. lateral menisci derivation, e.g. meniscus tear vs. meniscal tear • syntactic variation, e.g. stone in kidney vs. kidney stone

  11. FlexiTerm:Flexible term recognition

  12. Method overview • FlexiTerm is an open-source, stand-alone application for automatic term recognition • similarly to C–value, FlexiTerm performs term recognition in two stages: • lexico–syntactic filters are used to select term candidates • term candidates are scored using a formula that estimates their collocational stability • major difference: the flexibility with which term candidates are compared in order to neutralise syntactic, morphological & orthographic variation

  13. Normalisation • in order to neutralise variation, all term candidates are normalised • treat each term candidate as a bag of words • remove punctuation (e.g. ' in possessives), numbers and stop words including prepositions (e.g. of) • remove any lowercase tokens with 2 characters(e.g. Baker's cyst vs. vitamin D) • stem each remaining token hypoxia at rest  {hypoxia, rest} resting hypoxia • add similar tokens to the bag of words (cont.)

  14. Token similarity • many types of morphological variation are effectively neutralised with stemming • e.g. transplant & transplantation will be reduced to the same stem • exact string matching will not link orthographic variants • e.g. haemorrhage & hemorrhage are stemmed to haemorrhag & hemorrhag respectively • easily identified using lexical similarity (edit distance) • phonetic similarity is also important in dealing with new phenomena such as SMS language, e.g. l8 ~ late

  15. Syntactic variation • termhood formula: • term candidate: order does not matter! solves the problem of syntactic variation!

  16. FlexiTerm in Welsh

  17. Title

  18. Evaluation • parallel corpora • languages: Welsh, English • domains: education, politics, health • size: 100 documents per corpus • source: Welsh Government web site • silver standard • compare Welsh output to English • manually map each term in Welsh to its equivalent in English and vice versa

  19. Evaluation • precision: for each Welsh term, check whether its equivalent appeared in the English output • recall: for each English term, check whether its equivalent appeared in the Welsh output • kappa: we compared the differences in ranking using weighted kappa coefficient

  20. Further insights • term candidate selection • rheoliad/NN y/DTcyngor/NN (council regulation) • data/NNbiometrig/? (biometric data) • term candidate normalisation • better performance of the Welsh lemmatiser required • stemmer needed • termhood calculation • ansawdd1 gofal2 iechyd3 quality1of healthcare2

  21. http://www.corcencc.org/

  22. Thank you! Questions?

  23. Title

More Related