Categorizing Unknown Words:

Categorizing Unknown Words: Using Decision Trees to Identify Names and Misspellings

Janine TooleSimon Fraser UniversityBurnaby, BC, Canada From ANLP-NAACL Proceedings, April 29-May 4, 2000 (pp.. 173-179)

Goal: automatic categorization of unknown words • Unknown Words (UknWrds): word not contained in lexicon of NLP system • "unknown-ness" - property relative to NLP system

Motivation • Degraded system performance in presence of unknown words • Disproportionate effect possible • Min (1996) - only 0.6% of words in 300 e-mails misspelled • Result - 12% of the sentences contained an error (discussed in (Min and Wilson, 1998)). • Difficulties translating live closed captions (CC) • 5 seconds to transcribe dialogue, no post-edit

Reasons for unknown words • Proper name • Misspelling • Abbreviation or number • Morphological variant

And my favorite... • Misspoken words • Examples (courtesy H. K. Longmore): • *I'll phall you on the cone (call, phone) • *I did a lot of hiking by mysummer this self (myself this summer)

What to do? • Identify class of unknown word • Take action based on goals of system and class of word • Correct spelling • Expand abbr. • Convert number format

Overall System Architecture • Multiple components, one per category • Return confidence measure (Elworthy, 1998) • Evaluate results from each component to determine category • One reason for approach: take advantage of existing research

Simplified Version:Names & Spelling Errors • Decision tree architecture • combine multiple types of evidence about word • Results combined using weighted voting procedure • Evaluation: Live CC data - replete with wide variety of UknWds

Name Identifier • Proper names ==> proper name bucket • Others ==> discard • PN : person, place, concept, typically requiring Caps in English

Problems • CC is ALL CAPS! • No confidence measure with existing PN Recognizers • Perhaps future PNRs will work?

Solution • Build custom PNR

Decision Trees • Highly explainable - readily understand features affecting analysis • Well suited for combining a variety of info. • Don't grow tree from seed - use IBM's Intelligent Miner suite • Ignore DT algorithm - point is application of DT

Proper Names - Features • 10 features specified per UknWrd • POS and Detailed POS of UknWrd + and - 2 words • Rule-based system for detailed tags • in-house statistical parser for POS • Would include feature indicating presence of Initial Upper Case if data had it

Misspellings • Unintended, orthographically incorrect representation • Relative to NLP system • 1 or more additions, deletions, substitutions, reversals, punctuation

Orthography • Word: orthographyor.thog.ra.phy \o.r-'tha:g-r*-fe-\ n 1a: the art of writing words with the proper letters according to standard usage 1b: the representation of the sounds of a language by written or printed symbols 2: a part of language study that deals with letters and spelling

Misspellings - Features • Derived from prior research (including own) • Abridged list of features used • Corpus freq., word length, edit distance, Ispell info, char seq. freq., Non-Engl. chars

Misspellings Features (cont.) • Word length - (Agirre et. al., 1998) • Predictions for correct spelling more accurate if |w| > 4

Misspellings Features (cont.) • Edit distance • 1 edit distance == 1 substitution, addition, deletion, reversal • 80% of errors w/in 1 edit distance of intended word • 70% w/in 1 edit distance of intended word • Unix spell checker: ispell • edit distance = distance from UnkWrd to closest ispell suggestion, or 30

Misspellings Features (cont.) • Char. Seq. Freq. • wful, rql, etc. • composite of individual char. seq. • relevance to 1 tree vs. many • Non-English - Transmission noise in CC case, or Foreign names

Decision Time • Misspelling module says not a misspellPNR says its a name -> name • Both negative -> neither misspell nor name • What if both are positive? • One with highest confidence measure wins • Confidence measure • per leaf, calculated from training data • correct predictions / total # of predictions at leaf

Evaluation - Dataset • 7000 cases of UnkWrds • 2.6 million word corpus • Live business news captions • 70.4% manually ID'd as names • 21.3% as misspellings • Rest - other types of UnkWrds

Dataset (cont.) • 70% of Dataset randomly selected as training corpus • Remainder (2100) for test corpus • Test data - 10 samples, random selection with replacement • Total of 10 test datasets

Evaluation - Training • Train a DT with misspelling module • Train a DT with misspelling & name module • Train a DT with name module • Train a DT with name & misspelling module

baseline - no recall 1st decision tree -73.8% recall 2nd decision tree - increase in precision, decrease in recall by similar amount name features not predictive for ID'ing misspellings in this domain not surprising - 8 of 10 features deal with information external to word itself Misspelling DT Results - Table 3

Misspelling DT failures • 2 classes of omissions • Misidentifications • Foreign words

Omission type 1 • Words with typical characteristics of English words • Differ from intended word by addition or deletion of a syllable • creditability for credibility • coordinatored for coordinated • representives for representatives

Omission type 2 • Words differing from intended word by deletion of a blank • webpage • crewmembers • rainshower

Fixes • Fix for 2nd type • feature to specify whether UnkWrd can be split into 2 known words • Fix for 1st type more difficult • homophonic relationship • phonetic distance feature

Name DT Results - Table 4 • 1st tree • precision is large improvement • recall is excellent • 2nd tree • increased recall & precision • unlike 2nd misspelling DT - why?

Name DT failures • Not ID'd as a name - Names with determiners • the steelers, the pathfinder • Adept at individual people, places • trouble with names having similar distributions to common nouns

Name DT failures (cont.) • Incorrectly ID'd as name • Unusual character sequences: sxetion, fwlamg • Misspelling identifier correctly ID's as misspellings • Decision-making component needs to resolve these

Unknown Word Categorizer • Precision = # of correct misspelling or name categorizations / total number of times a word was identified as misspelling or name • Recall = # of times system correctly ID's misspelling or name / # of misspellings and names existing in data

Confusion matrix of tie-breaker • Table 5 - good results • 5% of cases needed confidence measure • Majority of cases decision-maker rules in favor of name prediction

Confusion matrix (cont.) • Name DT has better results, likely to have higher confidence measures • UknWrd as Name when it is a misspelling (37 cases) • Phonetic relation with intended word - temt, tempt; floyda, Florida;

Encouraging Results • Productive approach • Future focus • Improve existing components • features sensitive to distinction between names & misspellings • Develop components to ID remaining types • abbr., morph variants, etc. • Alternative decision-making process

Portability • Little required linguistic resources • Corpus of new domain (language) • Spelling suggestions • ispell avail. for many languages • POS tagger

Possible portability problems • Edit distance • Words consist of alphabetic chars. having undergone subst/add/del • Less useful for Chinese, Japanese • General approach still transferable • consider means by which misspellings differ from intended words • identify features to capture differences

Related Research • Assume all UknWrds are misspellings • Rely on capitalization • Expectations from scripts • Rely on world knowledge of situation • e.g. naval ship-to-shore messages

Related Research (cont.) • (Baluja et al., 1999)DT classifier to ID PNs in text • 3 features: word level, dictionary level,POS information • Highest F-score: 95.2% • slightly higher than name module

But... • Different tasks • ID all words & phrases that are PNs • vs. ID only those words which are UknWrds • Different data - Case information • If word-level features (case) excludedF-score of 79.7%

Conclusion • UknWrd Categorizer to ID misspellings & names • Individual components, specializing in identifying a particular class of UknWrd • 2 Existing components use DTs • Encouraging results in a challenging domain (live CC transcripts)!

Categorizing Unknown Words:

Categorizing Unknown Words:

Presentation Transcript

COMPOUND WORDS

Microbiology Unknown Lab

Context Clues: You be the Detective by Mrs. Renee Garner

cvc words

High Frequency Words Words 301-400

High Frequency Words Words 201-300

FBD: 3D Force Reaction

Interventions

Reverse Engineering Maneuvers

AWAY IN A MANGER

How to ID an Unknown Organism: Gram Negative Rods

Statistical Inference

Gram Negative Rods

Phonics Words with long ‘e’ sound spelled ‘ee'

Fry Words

Cyclic Patterns of Change

HIGH FREQUENCY WORDS

High Frequency Words Words 701-800

Word Club Words

HIGH FREQUENCY WORDS

PYREXIA OF UNKNOWN ORIGIN