Lightly Supervised Learning of Text Normalization: Russian Number Names

Lightly Supervised Learning of Text Normalization: Russian Number Names Richard Sproat OHSU & Google

Two components of text normalization • Given a string of characters in a text, what is the (reasonable) set of possible actual words (or word sequences) that might correspond to it. • Which of those is right for the particular context? Sproat: Russian Number Names

A concrete example of finite-state methods in textnormalization: digit to number name translation • Factor digit string: • 123 → 1 · 102 + 2 · 101 + 3 • Translate factors into number names: • 102 → hundred • 2 · 101 → twenty • 1 · 101 + 3 → thirteen • Languages vary on how extensive these lexicons are. Some (e.g. Chinese) have very regular (hence very simple) number name systems; others (e.g. Urdu/Hindi) have a large set of number names with a name for almost every number from 1 to 100. • Each of these steps can be accomplished with FSTs Sproat: Russian Number Names

Digit string factoring transducer (fragment) Sproat: Russian Number Names

Russian number names • Russian distinguishes • two numbers (singular, plural), • three genders (masculine, feminine, neuter) • six cases (nominative, accusative, genitive, dative, prepositional and instrumental). • In general, numbers agree in gender with the nouns they modify. • Thus один городone city has one in the masculine nominative/accusative, but • одна собакаone dog, has one in the feminine, • два городаtwo cities, versus • две собакиtwo dogs. • In an oblique case, such as the instrumental, the numeral must agree with the noun in case: • в двух шагахat two paces. • Complex numerals decline in their entirety: • к тремстам тридцати шести часамto three hundred and thirty six hours (dative case) • с пятью тысячами пятьюстами семьюдесятью четырьмя рублямиwith five thousand five hundred and seventy four rubles(instrumental case). Sproat: Russian Number Names

Contextually appropriate renditions of2, 3000, 25 Sproat: Russian Number Names

Lightly-supervised procedure • Provide a seed-list L of all legal forms of single-word number terms. • Mine web pages for sequences of terms from L, along with their contexts. • Using a loose number-name grammar, filter the resulting list for combinations that fit the general properties expected of well-formed number names; • the grammar is implemented as a finite-state transducer, which will accept only reasonable-looking number names, and map them to their corresponding digit sequences. • The result of the previous step is a large list of annotated digit-string/numbername pairs in context. • We now use these data to train a model that will produce a contextually appropriate number name expansion given a digit string. Sproat: Russian Number Names

Sample seed items Ideally such examples could be mined from online grammars … in this work we entered these manually Sproat: Russian Number Names

Finite-state filter • Implements linguistic constraints on how number terms are combined into number names (Brandt Corstius, 1968; Hurford, 1975) • Allows a set of possible factorizations. E.g. for 345,000: • Western: (3x102+4x101+5)x103 • East Asian: (3x101+4)x104+5x103 • South Asian: 3x105+(4x101+5)x103 Sproat: Russian Number Names

Grammar overgenerates • E.g., the following examples for “20”: • Most of these will be eliminated since they will not be found on the Web Sproat: Russian Number Names

Mined number names in context Sproat: Russian Number Names

N-gram language model • We selected from our web data 7.5 million examples of Russian number names in context, comprising 60 million words. • From these data we constructed a trigram language model using Kneser-Ney smoothing. • Two sets of test data. • token balanced: 1,000 examples randomly selected from a held-out portion of the web corpus. • двух can occur multiple times in this sample • type balanced: select exactly one instance of each number name type. this resulted in a test set with 826 examples. • двух would occur just once • In both cases, the test set was processed by replacing the number names with their digit representation, and then using the numbername expansion FST described earlier to map back to all possible expansions of the number • This produces a lattice of possible number-names, which is then scored with the language model Sproat: Russian Number Names

Results: token-balanced test Number names, 2 or more words long Number names, 5 or more words long Expansions that are ill-formed in any context Sproat: Russian Number Names

Results: type-balanced test Sproat: Russian Number Names

Discriminative approaches for single-word number names • Trained two discriminative methods: • Perceptron, using SNoW toolkit (Carlson et al, 1999) • Decision lists (Yarowsky, 1996) • Features: • The word to the immediate left/right of the number (L1, R1) • Each other word in the left/right context, tagged as being in the left/right context • The bigram to the immediate left, spanning, and right of the number • The two-character suffix of the word to the left, and the word to the right of the number Note that these are a superset of the features available to the n-gram language model • Test data were as many as 1,000 examples of each number Sproat: Russian Number Names

Comparison of methods Sproat: Russian Number Names

Comments • In some cases the n-gram language model was shooting itself in the foot: • *десять четыре ten four for 14 (the correct form being четырнадцать) • There is no clear advantage of the discriminative methods chosen • For the n-gram model, most of the errors are errors of well-formed number names in context – not ill-formed number names • The data we have worked with consists of number names that are written out as words, where the task was to reconstruct those words from a digit representation of the same numbers. • But do people tend to write the same kinds of numbers as words as they write with digits? Sproat: Russian Number Names

Distribution ofnumber names vs. digit strings Sproat: Russian Number Names

Further work • Obviously other techniques should be tried… • Other complexities in Russian normalization that relate to number names. How do you say: “5%”? • The finite-state filter is not general enough: • Literary Welsh: 99 → pedwar ar bymtheg a phedwar ugain (four on fifteen and four twenties) Sproat: Russian Number Names

Lightly Supervised Learning of Text Normalization: Russian Number Names