1 / 42

Categorizing Unknown Words:

Categorizing Unknown Words:. Using Decision Trees to Identify Names and Misspellings. Janine Toole Simon Fraser University Burnaby, BC, Canada. From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179). Goal: automatic categorization of unknown words.

thiery
Download Presentation

Categorizing Unknown Words:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings

  2. Janine TooleSimon Fraser UniversityBurnaby, BC, Canada From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179)

  3. Goal: automatic categorization of unknown words • Unknown Words (UknWrds): word not contained in lexicon of NLP system • "unknown-ness" - property relative to NLP system

  4. Motivation • Degraded system performance in presence of unknown words • Disproportionate effect possible • Min (1996) - only 0.6% of words in 300 e-mails misspelled • Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). • Difficulties translating live closed captions (CC) • 5 seconds to transcribe dialogue, no post-edit

  5. Reasons for unknown words • Proper name • Misspelling • Abbreviation or number • Morphological variant

  6. And my favorite... • Misspoken words • Examples (courtesy H. K. Longmore): • *I'll phall you on the cone (call, phone) • *I did a lot of hiking by mysummer this self (myself this summer)

  7. What to do? • Identify class of unknown word • Take action based on goals of system and class of word • Correct spelling • Expand abbr. • Convert number format

  8. Overall System Architecture • Multiple components, one per category • Return confidence measure (Elworthy, 1998) • Evaluate results from each component to determine category • One reason for approach: take advantage of existing research

  9. Simplified Version:Names & Spelling Errors • Decision tree architecture • combine multiple types of evidence about word • Results combined using weighted voting procedure • Evaluation: Live CC data - replete with wide variety of UknWds

  10. Name Identifier • Proper names ==> proper name bucket • Others ==> discard • PN : person, place, concept, typically requiring Caps in English

  11. Problems • CC is ALL CAPS! • No confidence measure with existing PN Recognizers • Perhaps future PNRs will work?

  12. Solution • Build custom PNR

  13. Decision Trees • Highly explainable - readily understand features affecting analysis • Well suited for combining a variety of info. • Don't grow tree from seed - use IBM's Intelligent Miner suite • Ignore DT algorithm - point is application of DT

  14. Proper Names - Features • 10 features specified per UknWrd • POS and Detailed POS of UknWrd + and - 2 words • Rule-based system for detailed tags • in-house statistical parser for POS • Would include feature indicating presence of Initial Upper Case if data had it

  15. Misspellings • Unintended, orthographically incorrect representation • Relative to NLP system • 1 or more additions, deletions, substitutions, reversals, punctuation

  16. Orthography • Word: orthographyor.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to standard usage 1b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling

  17. Misspellings - Features • Derived from prior research (including own) • Abridged list of features used • Corpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. chars

  18. Misspellings Features (cont.) • Word length - (Agirre et. al., 1998) • Predictions for correct spelling more accurate if |w| > 4

  19. Misspellings Features (cont.) • Edit distance • 1 edit distance == 1 substitution, addition, deletion, reversal • 80% of errors w/in 1 edit distance of intended word • 70% w/in 1 edit distance of intended word • Unix spell checker: ispell • edit distance = distance from UnkWrd to closest ispell suggestion, or 30

  20. Misspellings Features (cont.) • Char. Seq. Freq. • wful, rql, etc. • composite of individual char. seq. • relevance to 1 tree vs. many • Non-English - Transmission noise in CC case, or Foreign names

  21. Decision Time • Misspelling module says not a misspellPNR says its a name -> name • Both negative -> neither misspell nor name • What if both are positive? • One with highest confidence measure wins • Confidence measure • per leaf, calculated from training data • correct predictions / total # of predictions at leaf

  22. Evaluation - Dataset • 7000 cases of UnkWrds • 2.6 million word corpus • Live business news captions • 70.4% manually ID'd as names • 21.3% as misspellings • Rest - other types of UnkWrds

  23. Dataset (cont.) • 70% of Dataset randomly selected as training corpus • Remainder (2100) for test corpus • Test data - 10 samples, random selection with replacement • Total of 10 test datasets

  24. Evaluation - Training • Train a DT with misspelling module • Train a DT with misspelling & name module • Train a DT with name module • Train a DT with name & misspelling module

  25. baseline - no recall 1st decision tree -73.8% recall 2nd decision tree - increase in precision, decrease in recall by similar amount name features not predictive for ID'ing misspellings in this domain not surprising - 8 of 10 features deal with information external to word itself Misspelling DT Results - Table 3

  26. Misspelling DT failures • 2 classes of omissions • Misidentifications • Foreign words

  27. Omission type 1 • Words with typical characteristics of English words • Differ from intended word by addition or deletion of a syllable • creditability for credibility • coordinatored for coordinated • representives for representatives

  28. Omission type 2 • Words differing from intended word by deletion of a blank • webpage • crewmembers • rainshower

  29. Fixes • Fix for 2nd type • feature to specify whether UnkWrd can be split into 2 known words • Fix for 1st type more difficult • homophonic relationship • phonetic distance feature

  30. Name DT Results - Table 4 • 1st tree • precision is large improvement • recall is excellent • 2nd tree • increased recall & precision • unlike 2nd misspelling DT - why?

  31. Name DT failures • Not ID'd as a name - Names with determiners • the steelers, the pathfinder • Adept at individual people, places • trouble with names having similar distributions to common nouns

  32. Name DT failures (cont.) • Incorrectly ID'd as name • Unusual character sequences: sxetion, fwlamg • Misspelling identifier correctly ID's as misspellings • Decision-making component needs to resolve these

  33. Unknown Word Categorizer • Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name • Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data

  34. Confusion matrix of tie-breaker • Table 5 - good results • 5% of cases needed confidence measure • Majority of cases decision-maker rules in favor of name prediction

  35. Confusion matrix (cont.) • Name DT has better results, likely to have higher confidence measures • UknWrd as Name when it is a misspelling (37 cases) • Phonetic relation with intended word - temt, tempt; floyda, Florida;

  36. Encouraging Results • Productive approach • Future focus • Improve existing components • features sensitive to distinction between names & misspellings • Develop components to ID remaining types • abbr., morph variants, etc. • Alternative decision-making process

  37. Portability • Little required linguistic resources • Corpus of new domain (language) • Spelling suggestions • ispell avail. for many languages • POS tagger

  38. Possible portability problems • Edit distance • Words consist of alphabetic chars. having undergone subst/add/del • Less useful for Chinese, Japanese • General approach still transferable • consider means by which misspellings differ from intended words • identify features to capture differences

  39. Related Research • Assume all UknWrds are misspellings • Rely on capitalization • Expectations from scripts • Rely on world knowledge of situation • e.g. naval ship-to-shore messages

  40. Related Research (cont.) • (Baluja et al., 1999)DT classifier to ID PNs in text • 3 features: word level, dictionary level,POS information • Highest F-score: 95.2% • slightly higher than name module

  41. But... • Different tasks • ID all words & phrases that are PNs • vs. ID only those words which are UknWrds • Different data - Case information • If word-level features (case) excludedF-score of 79.7%

  42. Conclusion • UknWrd Categorizer to ID misspellings & names • Individual components, specializing in identifying a particular class of UknWrd • 2 Existing components use DTs • Encouraging results in a challenging domain (live CC transcripts)!

More Related