1 / 37

Hindi Parsing

Hindi Parsing. Samar Husain LTRC, IIIT-Hyderabad, India. Outline. Introduction Grammatical framework Two stage parsing Evaluation Two stage constraint based parsing Integrated data driven parsing Two stage data driven parsing. Introduction. Broad coverage parser for Hindi

vail
Download Presentation

Hindi Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hindi Parsing Samar Husain LTRC, IIIT-Hyderabad, India.

  2. Outline • Introduction • Grammatical framework • Two stage parsing • Evaluation • Two stage constraint based parsing • Integrated data driven parsing • Two stage data driven parsing

  3. Introduction • Broad coverage parser for Hindi • Very crucial • MT systems, IE, co-reference resolution, etc. • Attempt to make a hybrid parser • Grammatical framework: Dependency

  4. Introduction • Levels of analysis before parsing • Morphological analysis (Morph Info.) • Analysis in local context (POS tagging, Chunking, case markers/postpositions computation) • We parse after the above processing is done.

  5. Computational Paninian Grammar (CPG) • Based on Panini’s Grammar • Inspired by inflectionally rich language (Sanskrit) • A dependency based analysis (Bharati et al. 1995a) • Earlier parsing approaches for Hindi (Bharati et. al, 1993; 1995b; 2002)

  6. CPG (The Basic Framework) • Treats a sentence as a set of modifier-modified relations • Sentence has a primary modified or the root (which is generally a verb) • Gives us the framework to identify these relations • Relations between noun constituent and verb called ‘karaka’ • karakas are syntactico-semantic in nature • Syntactic cues help us in identifying the karakas

  7. karta – karma karaka ‘The boy opened the lock’ • k1 – karta • k2 – karma • karta, karma usually correspond to agent, theme respectively • But not always • karakas are direct participants in the activity denoted by the verb • For complete list of dependency relations: (Begum et al., 2008) open k1 k2 boy lock

  8. Hindi Parsing: Approaches tried • Two stage constraint based parsing • Data driven parsing • Integrated • Two stage

  9. Two stage parsing • Basic idea • There are two layers (stages) • The 1st stage handles intra-clausal relations, and the 2nd stage handles inter-clausal relations, • The output of each stage is a linguistically sound partial parse that becomes the input to the next layer

  10. Stage 1 • Identify intra-clausal relations • the argument structure of the verb, • noun-noun genitive relation, • infinitive-verb relation, • infinitive-noun relation, • adjective-noun, • adverb-verb relations, • nominal coordination, etc.

  11. Stage 2 • Identify inter-clausal relations • subordinating conjuncts, • coordinating conjuncts, • relative clauses, etc.

  12. How do we do this? • Introduce a dummy __ROOT__ node as the root of the dependency tree • Helps in giving linguistically sound partial parses • Keeps the tree connected • Classify the dependency tags into two sets • Tags that function within a clause, • Tags that relate two clauses

  13. An example mai ghar gayaa kyomki mai bimaar thaa ’I’ ’home’ ’went’ ’because’ ’I’ ’sick’ ‘was’ ‘I went home because I was sick’

  14. The parses (a): 1st stage output, (b): 2nd stage final parse

  15. 2 stage parsing • 1st stage • All the clauses analyzed • Analyzed clauses become children of __ROOT__ • Conjuncts become children of __ROOT__ • 2nd stage • Does not modify the 1st stage analysis • Identifies relations between 1st stage parsed sub-trees

  16. Important linguistic cues that help Hindi parsing • Nominal postpositions • TAM classes • Morphological features • root of the lexical item, etc. • POS/Chunk tags • Agreement • Minimal semantics • Animate-inanimate • Human-nonhuman

  17. Nominal postpositions and TAM • rAma ø mohana ko KilOnA xewA hE ‘Ram’ ‘Mohana’ DAT ‘toy’ ‘give’ ‘Ram gives a toy to Mohana’ • rAma ne mohana ko KilOnA xiyA ‘Ram’ ERG ‘Mohana’ DAT ‘toy’ ‘gave’ ‘RamgaveMohan a toy’ • rAma ko mohana ko KilOnA xenA padZA’ ‘Ram’ DAT ‘Mohana’ DAT ‘toy’ ‘had to give’ ‘Ram had to give Mohan a toy’ • The TAM dictates the postposition that appears on the noun ‘rAma’ • Related concept in CPG • Verb frames and transformation rules (Bharati et al., 1995)

  18. Agreement • rAma ø mohana ko KilOnA xewA hE | ‘Ram gives a toy to Mohana’ • kaviwA ø mohana ko KilOnA xewI hE | ‘Kavita gives a toy to Mohana’ • Verb agrees with ‘rAma’ and ‘kaviwA’ • Agreement helps in identifying ‘k1’ and ‘k2’ • But there are some exceptions to this.

  19. Evaluation • Two stage constraint based parser • Data driven parsing • Integrated • 2 stage

  20. Constraint based hybrid parsing • Constraint satisfaction problem (Bharati et al. 2008a) • Hard constraints • Rule based • Soft constraints • ML • Selective resolution of demands • Repair • Partial Parses

  21. Overall performance UA: unlabeled attachments accuracy, L : labeled accuracy LA: labeled attachment accuracy

  22. Error analysis • Reasons for low LA • Less verb frames • Some phenomena not covered • Prioritization errors

  23. Data driven parsing (Integrated) • Tuning Malt and MST for Hyderabad dependency treebank (Bharati et al., 2008b) • Experiments with different feature • including minimal semantics and agreement

  24. Experimental Setup • Data • 1800 sentences, average length of 19.85 words, 6585 unique tokens. • training set: 1178 sentences • development and test set: 352 and 363 sentences

  25. Experimental Setup • Parsers • Malt-version 1.0.1 (Nivre et al., 2007) • arc eager • SVM • MST-version 0.4b (McDonald et al., 2005) • Non-projective • No. of highest scoring trees (k)=5 • Extended feature set for both parsers

  26. Consolidated results

  27. Error analysis • Reasons for low ‘LA’ • Difficulty in extracting relevant linguistic cues • Agreement • Similar contextual features: Label bias • Non-projectivity • Lack of explicit cues • Long distance dependencies • Complex linguistic phenomena • Less corpus size

  28. Observations • Features that proved crucial • TAM (classes) and nominal postpositions • Minimal semantics • Animate-inanimate • Human-nonhuman • Agreement • After making it visible

  29. Data driven parsing: 2 stage (Bharati et al., 2009) • MST parser • Non-projective • FEATS: nominal and verbal inflections, morph info. • Data • 1492 sentences • Training, development and testing: 1200, 100 and 192 respectively.

  30. Modular parsing • Intra-clausal and Inter-clausal separately • Introduce a dummy __ROOT__ • Parse clauses in 1st stage • Then parse relations between clauses in 2nd stage

  31. Comparison with integrated parser There was 2.05%, 1.62%, 1.98% increase in LA, UA and L respectively.

  32. Evaluation

  33. Advantages • Learning long distance dependencies becomes easy • Stage 2 specifically learns them efficiently • Few non-projective sentences • Only intra-clausal ones remain • Search space becomes local • Handling complex sentences becomes easy

  34. Error analysis • Reasons for low ‘LA’ (in 1st stage) • Unavailability of explicit cues • Combining modular parsing with minimal semantics should help • Difficulty in learning complex cues • Agreement • Similar contextual features: Label bias • Less corpus size

  35. References • R. Begum, S. Husain, A. Dhwaj, D. Sharma, L. Bai, and R. Sangal. 2008. Dependency annotation scheme for Indian languages. In Proceedings of IJCNLP-2008. • A. Bharati and R. Sangal. 1993. Parsing Free Word Order Languages in the Paninian Framework. Proc. of ACL:93. • A. Bharati, V. Chaitanya and R. Sangal. 1995a. Natural Language Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi. • A. Bharati, A. Gupta and Rajeev Sangal. 1995b. Parsing with Nested Constraints. In Proceedings of 3rd NLP Pacific Rim Symposium. Seoul. • A. Bharati, R. Sangal and T. P. Reddy. 2002. A Constraint Based Parser Using Integer Programming In Proc. of ICON-2002. • A. Bharati, S. Husain, D. Sharma, and R. Sangal. 2008a. A two-stage constraint based dependency parser for free word order languages. In Proceedings of the COLIPS IALP, Chiang Mai, Thailand. • A. Bharati, S. Husain, B. Ambati, S. Jain, D. Sharma, and R. Sangal. 2008b. Two semantic features make all the difference in parsing accuracy. In Proceedings of the 6th ICON, Pune, India. • A. Bharati, S. Husain, S. P. K. Gadde, B. Ambati, and R. Sangal. 2009.A modular cascaded partial parsing approach to complete parsing. In Submission. • R. McDonald, F. Pereira, K. Ribarov, and J. Hajic. 2005. Non-projective dependency parsing using spanning tree algorithms. In Proc. of HLT/EMNLP, pp. 523–530. • J. Nivre, J. Hall, J. Nilsson, A. Chanev, G. Eryigit, S. Kübler, S. Marinov and E Marsi. 2007. MaltParser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95-135.

  36. Thanks!!

More Related