110 likes | 303 Views
Developing a Persian Part of Speech Tagger. Karine Megerdoomian University of California, San Diego karinem@ling.ucsd.edu. Part of speech tagging applications Machine translation Information extraction Parsing Overview of issues Encoding issues Long-distance dependencies
E N D
Developing a Persian Part of Speech Tagger Karine Megerdoomian University of California, San Diego karinem@ling.ucsd.edu
Part of speech tagging applications • Machine translation • Information extraction • Parsing • Overview of issues • Encoding issues • Long-distance dependencies • Boundaries of words and phrases • Complex tokens and multiword expressions
Types of Taggers • Knowledge-based taggers • Based on grammar rules • Can analyze complex structures • Cannot guess unknowns • Statistical taggers • Trained on a pre-tagged corpus • Can guess unknowns • Saturation • Hybrid taggers • Knowledge-based for tagging • Statistical for guessing and disambiguating
Tagset Design • Tagset = Set of annotation tags used to classify analyzed tokens. • Tagset provides relevant linguistic information about syntactic and syntactic properties of the word. • Tagset design depends on final application of system • Part of speech • Part of speech types (for information retrieval) • Boundaries (for parsing) • Semantic information (for word sense disambiguation) • Tagset size should be small for probabilistic machines
Script and Encoding • Diacritics are optionalپست • Multiple encodings Persian unicode (\u06a9 for ک and \u064a for ی) Arabic unicode (\u0643 for ک and \u06cc or \u0649 for ی) • Control characters unicode \u200c to mark final form of letters
Word Boundaries • Optional whitespace • Can use a post-segmentation script to tokenizeرفتندمردم • Separated affixes می دادند vs. میدادند فلسطينی ها vs. فلسطينیها • Complex tokens • Two different POS categories بشيوه – اينکار – بهترست دردفتر
Multiword Expressions • Lexical units • Usually need to be listed in lexiconبنابراين • Morphological units • Can be analyzed in morphological analyzerفروخته بوده اند • Phrasal verbs • Have syntactic properties analyzed in intermediate level between morphology and syntax اظهار تأسف کردند ماشين لباسشوئی – خمرهای سرخ
Phrasal Boundaries • Noun Phrase highly ambiguous • Short vowels are not written • Lack of overt boundary markers • No particles linking NP elements (اضافه is often not written) • Subject-Object-Verb order • Very long sentences • Relatively free word order • Boundary markers • Proper names and pronouns وزير خارجه آينده آمريکا • Morphemesی، تان/شان، را • Lack of boundary: اضافهانفجارهای اخيرعراق
Long distance dependencies • Some tenses of the verb can only be determined if we take into account the co-occurrence of the prefix and the person inflection and auxiliary forms. Problem for linear approaches (e.g., two-level morphology) • Imperfect میگريختند • Compound Imperfect می گريخته است • Perfect گريخته است
Phonetics and Phonology • Consistent phonological patterns • Form of the affix varies based on last character of the word - گدايانبيگانگاندانشجويي • Phonological rules apply across categories no need to list all possible forms if use rules • Mismatch between orthography and phonetics need to distinguish words based on their pronunciation • دانشجو vs. گاو
Conclusion • Overview of the main challenges encountered in the development of a POS tagger for Persian. • Introduced certain criteria to be considered in designing an annotation set for POS tagging. • Contrasted various approaches and proposed possible methods for resolving these computational and linguistic issues.