Building a Tagged Corpus of Russian: A Bazaar Approach

Chris Tessone Modern Languages Department Knox College ctessone@knox.edu Committee: Charles Mills Don Blaheta Jay Krumbholz Steven Clancy, U. of Chicago Building a Tagged Corpus of Russian: A Bazaar Approach

Corpora of Natural Language Texts • Applications: • Linguistics: Empirical word distribution data • Education: Morphological helps for new words • Corpora now available in many languages: • Czech National Corpus (100,000,000 words) • NEGRA (350,000 words) • Kyoto University Corpus (1,000,000 words)

Problems with Traditional Corpus Development • Annotating data requires many man-hours of skilled labor • Resulting corpora for non-commercial use only • Features may not match community expectations • Ethical problems with correcting errors

The Bazaar Model • First used in software engineering • Fetchmail • Linux kernel • Users develop the features they need • Healthy projects require a critical mass of volunteers

Advantages of the Bazaar Model for Corpus Development • Corpora can be developed quickly and inexpensively • Most important tools and data emerge first • Open licensing gives corporations incentive to help • IBM, SGI, and the Linux kernel • Netscape and Mozilla

A Tagged Corpus of Russian • Texts taken from Russian LiveJournals • Varying registers • International community • Software licensed under GPL, data under Creative Commons • First such corpus of Russian texts in the world

XML Basics • Similar to HTML • Each unit of data is marked by start and end tags • Start tags can include attributes • Wide range of XML aware software

The Annotation Process • Removal of HTML markup • Sentence boundary annotation • Tokenization • Part of speech tagging

Example Post: After HTML Removal

Sentence Boundary Detection • Place sentence boundaries where . ? and ! precede a capital letter • Also place boundaries where an emoticon precedes a capital letter • Disqualify a sentence boundary if period is part of certain abbreviations

Example Post: After Sentence Boundary Detection

Tokenization • Any string surrounded by whitespace tentatively considered a token • Most punctuation also separated • Periods in abbreviations not separated • Emoticons and ellipses considered a single token.

Example Post: After Tokenization

Part of Speech Tagging • Many words are ambiguous, even in Russian • segodnja • chto • Part of speech depends on surrounding words • Suffix probabilities help in tagging previously unseen words.

The Viterbi Algorithm • Eliminates some combinations of tags, saving on calculations • Proven to preserve most likely combination

Example Post: After Part of Speech Tagging

Results • Training 1500 words, testing 500 words • Naïve method: 58.1% • With 2-letter suffix function: 72.6% • With variable-length suffix function: 74.0% • Training 2300 words, testing 500 words • Naïve method: 60.1% • With 2-letter suffix function: 76.7% • With variable-length suffix function: 78.3%

Conclusions • Feedback is positive. • Results suggest automatic annotation useful with small training (1500 words). • New data at 450 words per hour • Sentence boundaries at 18,000 words/hour • Tokenization at 10,000 words/hour • Part of speech tagging at 500 words/hour

Chris Tessone Candidate for Honors in Russian Modern Languages Department Knox College ctessone@knox.edu

Building a Tagged Corpus of Russian: A Bazaar Approach