250 likes | 367 Views
EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS. Karine Megerdoomian University of Maryland, College Park karinem@umiacs.umd.edu. دانشگاه تهران. دومین کارگاه پژوهشی زبان فارسی و رایانه. Talk Outline. Persian Weblogs
E N D
EXTENDING A PERSIAN MORPHOLOGICAL ANALYZER TO BLOGS Karine Megerdoomian University of Maryland, College Park karinem@umiacs.umd.edu دانشگاه تهران دومین کارگاه پژوهشی زبان فارسی و رایانه
Talk Outline • Persian Weblogs • Persian is the 4th largest blog language in the world (~75,000 sites) • Description of a finite-state morphological analyzer for Persian • System description • Language issues and implementation • Computational issues in weblogs
Language of Blogs • Contain both formal and informal morphology • Morphology • Informal text is very different from formal مرا گرفته است گرفتهتم • Features that don’t exist in formal فروشندهه؛ رفتش • Shortened verbal stems and inflection می گویند میگن
Language of Blogs • Morphology • Colloquial pronunciation غلطای املایی ؛ این سایتو ؛ دوستامونم ؛ دردناکه ؛ مثل منن ازشون ؛ خودتون ؛ نگاههایشان ؛ همسایهاشون • Spelling errors and non-standard punctuation & spacing • Emoticons and hyperlinks
Language of Blogs • Lexicon • Wordforms follow pronunciation اوضاش ؛ برام ؛ نگامی کنم ؛ خونه ؛ تمبل ؛ همدیگه ؛ بش گفتم • Colloquial forms تو دانشگاه ؛ واسه استادام • New words لینکدونی ؛ دوستان کامنتگذار
Language of Blogs • Lexicon • Loan words چتروم ؛ آنلاین ؛ دانلود کنین • Interjections آاااخ! ؛ والا ؛ وای ؛ اوووه! • More idiomatic expressions دمشگرم آقا
Language of Blogs • Huge amount of variation!! • Need for flexible rules • Phonological rules to represent colloquial speech • Need to disambiguate(statistical component?) • Formal blog text is also different from traditional formal text
Language of Blogs BBCخوابگرد موافقاند موافقند بینندهگان بینندگان کتاباش کتابش کمتر کمتر کافیست کافیست حتا حتی
Finite-State Transducers (FST) • Two-level network or transducer • Input = lower-side of arc • Output = upper-side of arc b i r d +Noun +Pl b i r d s
MA: System Description • Developed on Xerox Finite State Technology (XFST) [Karttunen & Beesley 1992] • Components: • Lexicon and morphology rules (lexc) • Phonological rules (regular expressions) • Compiled into a FST (finite-state transducer) • FST for each part of speech created separately then composed final FST for morphological analysis
MA: System Description Input string Phonology rules Noun FST Verb FST Final FST For Morphology COMPOSITION Adverb FST Output string
MA: System Description • Coverage: formal Persian language • Full verbal conjugation • Nonverbal inflection مسافرین ؛ فقرا • Productive derivational morphology سرسامآور • ~20 phonological rules • Proper nouns of people, places, organizations
Inflectional Morphology LEXICON Root ktab Noun ; LEXICON Noun +Pl:ha # ; کتابها +Pl:_ha # ; کتابها +Sg:0 # ; کتاب +Pl:a # ; کتابا
Complex Tokens • Two different POS categories بعقیدهشما ؛ اینکار؛ بهترست - دردفتر ؛ وگفت bh+Prep<eqydh+Noun+Sgبعقیده dr+Prep<dftr+Noun+Sgدردفتر ktab+Noun+Pl>av+Pron+Pers+Poss+1P+Plکتابهایمان برادرشهbradr+Noun+Sg>av+Pron+Pers+Poss+1P+Pl >bvdn+Verb+Ind+Pres+3P+Sg
Verbal Morphology • Two different stems
LEXICON PastStem tvanst Infl1 ; rft Infl1 ; xndyd Infl1 ; LEXICON PresentStem tvanst:tvan Infl2 ; rft:rv Infl2; xndyd:xnd Infl2; LEXICON PstStemBlog tvnst InflBlog1; LEXICON PrStemBlog tvanst:tvn Infl2 ; rft:r Infl2; Verbal Morphology
Long Distance Dependencies • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection / auxiliary problem for linear approaches
Long Distance Dependencies • Leads to very complex paths and continuation classes in lexc • Using filters largely increases the size of the FST • Use flag diacriticsfor unification (@U.Feature.Value@) - Keeps FST small- Can apply constraints between non-adjacent morphemes
Optional in informal blog text Phonology Rules • Form of affixes may change based on the ending character of the stem Formal: کتابش ؛ چشمهایش/صدایش ؛ همسایهاش Informal: کتابش ؛ چشماش/صداش ؛ همسایش define clitic1 [^NB 0 || Cons __ ] ; define clitic2 [^NB y || Vowel __ ] ; define clitic3 [^NB “\u200c” a || e __ ] ; ktab^NBš Sda^NBš hmsaye^NBš
Evaluation • FST: 178,452 states; 928,982 arcs before optimization • Speed: 20.84 CPU time in seconds for 10 MB file, on SunSparcStation • Coverage=97.5%; Accuracy=95% • Unanalyzed tokens: proper nouns + missing lexicon words • No weblog language rules included yet!
Conclusion • Challenges in morphological analysis of Persian formal text Solutions in XFST system • New issues and variance due to blog language • Need robust system: Lexicon updated with colloquial forms Flexible morphological rules + derivational morphology rules Transliteration component for loan words Statistical approach to disambiguate and to deal with unknowns