130 likes | 336 Views
Overview of the Hindi-Urdu Treebank. Fei Xia University of Washington 7/23/2011. (Syntactic) Treebank. S entences annotated with syntactic structure (dependency structure or phrase structure) 1960s: Brown Corpus Early 1990s: The English Penn Treebank
E N D
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011
(Syntactic) Treebank • Sentences annotated with syntactic structure (dependency structure or phrase structure) • 1960s: Brown Corpus • Early 1990s: The English Penn Treebank • Late 1990s: Prague Dependency Treebank • 1990s – now: Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Latin, Norwegian, Polish, Spanish, Turkish, etc.
PS and DS S John loves Mary . NP VP ./. Phrase structure (PS): John/NNP loves/VBP NP Mary/NNP loves/VBP Dependency structure (DS): John/NNP Mary/NNP ./.
Proposition Bank (PropBank) • Sentences annotated with predicate argument structure • Ex: John loves Mary • “loves” is the predicate • “John” is Arg0 (“Agent”) • “Mary” is Arg1 (“Theme”) • 2000s: The English PropBank, followed by the PropBanks for Chinese, Arabic, Hindi/Urdu, etc.
Why do we need treebanks? • Computational Linguistics: • To build and evaluate NLP tools (e.g., word segmenters, part-of-speech taggers, parsers, semantic role labelers) • This leads to significant progress of the CL field • Theoretical linguistics: • Annotation guidelines are like a grammar book, with more detail and coverage • As a discovery tool • One can test linguistic theories and collect statistics by searching treebanks.
The Hindi-Urdu Treebank (HUTB) • Traditional approach: • Syntactic treebank: PS or DS, but not both • Layers are added one-by-one • Our approach: • Syntactic treebank: both DS and PS • DS, PS, and PB are developed at the same time • Automatic conversion from DS+PB to PS
Motivation 1: Two Representations • Both phrase-structure treebanks and dependency treebanks are used in NLP • Collins/Charniak/Bikel parsers for PS • CoNLL task on dependency parsing • Problem: currently few treebanks (no?) with PS and DS which are independently motivated • Our project: build treebank for Hindi/Urdu for which PS and DS are linguistically motivated from the outset • Dependency: Paninian grammar (Panini 400 BC) • Phrase structure: variant of Minimalism (Chomsky 1995)
Motivation 2: Two Content Levels • Everyone (?) wants syntax • Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets” • Recent experience: PropBank may inform some treebanking decisions • Build treebank with all levels from the outset • Annotating them together allows us to study relation between DS/PB/PS and reduce annotation time
Goals • Hindi/Urdu Treebank: • DS, PB, and PS for • 400K-word Hindi • 150K-word Urdu • Unified annotation guidelines • Frame files for PropBank • Better understanding of the relation between DS, PB, and PS.
Where we are now • Guidelines are almost complete. • Annotation: • DS annotation: 354K-word Hindi, 60K-word Urdu • PB annotation: 40K-word Hindi • Automatic conversion from DS + PropBank in progress. • Preliminary release in 2009 and 2010
The HUTB team • IIIT, India (DS team): Dipti Sharma, Samar Husain, RahulAggarwal, etc. • Univ of Colorado at Boulder (PB team): Martha Palmer, BhuvanaNarasimhan, AshwiniVaidya, Archna Bhatia, etc. • UMass (PS team): Rajesh Bhatt, Annahitafarudi • Columbia Univ (PS team): Owen Rambow, • Univ. of Washington (Conversion): Fei Xia, Michael Tepper
Some Sample Structures • Guideline Sentences • transitive (25), causatives (4), AP predicate (10), 21 (clausal extraposition + unaccusative), participial adjunct (35), complex predicate (1) • Corpus Sentences