1 / 12

Overview of the Hindi-Urdu Treebank

Overview of the Hindi-Urdu Treebank. Fei Xia University of Washington 7/23/2011. (Syntactic) Treebank. S entences annotated with syntactic structure (dependency structure or phrase structure) 1960s: Brown Corpus Early 1990s: The English Penn Treebank

virgil
Download Presentation

Overview of the Hindi-Urdu Treebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011

  2. (Syntactic) Treebank • Sentences annotated with syntactic structure (dependency structure or phrase structure) • 1960s: Brown Corpus • Early 1990s: The English Penn Treebank • Late 1990s: Prague Dependency Treebank • 1990s – now: Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Latin, Norwegian, Polish, Spanish, Turkish, etc.

  3. PS and DS S John loves Mary . NP VP ./. Phrase structure (PS): John/NNP loves/VBP NP Mary/NNP loves/VBP Dependency structure (DS): John/NNP Mary/NNP ./.

  4. Proposition Bank (PropBank) • Sentences annotated with predicate argument structure • Ex: John loves Mary • “loves” is the predicate • “John” is Arg0 (“Agent”) • “Mary” is Arg1 (“Theme”) • 2000s: The English PropBank, followed by the PropBanks for Chinese, Arabic, Hindi/Urdu, etc.

  5. Why do we need treebanks? • Computational Linguistics: • To build and evaluate NLP tools (e.g., word segmenters, part-of-speech taggers, parsers, semantic role labelers) • This leads to significant progress of the CL field • Theoretical linguistics: • Annotation guidelines are like a grammar book, with more detail and coverage • As a discovery tool • One can test linguistic theories and collect statistics by searching treebanks.

  6. The Hindi-Urdu Treebank (HUTB) • Traditional approach: • Syntactic treebank: PS or DS, but not both • Layers are added one-by-one • Our approach: • Syntactic treebank: both DS and PS • DS, PS, and PB are developed at the same time • Automatic conversion from DS+PB to PS

  7. Motivation 1: Two Representations • Both phrase-structure treebanks and dependency treebanks are used in NLP • Collins/Charniak/Bikel parsers for PS • CoNLL task on dependency parsing • Problem: currently few treebanks (no?) with PS and DS which are independently motivated • Our project: build treebank for Hindi/Urdu for which PS and DS are linguistically motivated from the outset • Dependency: Paninian grammar (Panini 400 BC) • Phrase structure: variant of Minimalism (Chomsky 1995)

  8. Motivation 2: Two Content Levels • Everyone (?) wants syntax • Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets” • Recent experience: PropBank may inform some treebanking decisions • Build treebank with all levels from the outset • Annotating them together allows us to study relation between DS/PB/PS and reduce annotation time

  9. Goals • Hindi/Urdu Treebank: • DS, PB, and PS for • 400K-word Hindi • 150K-word Urdu • Unified annotation guidelines • Frame files for PropBank • Better understanding of the relation between DS, PB, and PS.

  10. Where we are now • Guidelines are almost complete. • Annotation: • DS annotation: 354K-word Hindi, 60K-word Urdu • PB annotation: 40K-word Hindi • Automatic conversion from DS + PropBank in progress. • Preliminary release in 2009 and 2010

  11. The HUTB team • IIIT, India (DS team): Dipti Sharma, Samar Husain, RahulAggarwal, etc. • Univ of Colorado at Boulder (PB team): Martha Palmer, BhuvanaNarasimhan, AshwiniVaidya, Archna Bhatia, etc. • UMass (PS team): Rajesh Bhatt, Annahitafarudi • Columbia Univ (PS team): Owen Rambow, • Univ. of Washington (Conversion): Fei Xia, Michael Tepper

  12. Some Sample Structures • Guideline Sentences • transitive (25), causatives (4), AP predicate (10), 21 (clausal extraposition + unaccusative), participial adjunct (35), complex predicate (1) • Corpus Sentences

More Related