170 likes | 185 Views
Parsing & Language Acquisition: Parsing Child Language Data. CSMC 35100 Natural Language Processing February 7, 2006. Roadmap. Motivation Understanding language acquisition Child and child-directed Challenges Casual conversation; ungrammatical speech Robust parsing
E N D
Parsing & Language Acquisition:Parsing Child Language Data CSMC 35100 Natural Language Processing February 7, 2006
Roadmap • Motivation • Understanding language acquisition • Child and child-directed • Challenges • Casual conversation; ungrammatical speech • Robust parsing • Dependency structure & dependency labels • Child language assessment • Contrast w/human & auto • Error analysis • Conclusions
Motivation: Child-directed Speech • View child as learner • Child-directed speech • Known to differ significantly from adult-directed • E.g. acoustic-prosodic measures • Slower, increased duration, pause, pitch range, height • Children attend more carefully • Input to learner • Track vocabulary, morphology, phonology, syntax • Relate exposure to acquisition
Motivation: Child Speech • Insight into language acquisition process • Assess linguistic development • Evaluate hypotheses for acquisition • E.g. Rizzi’s “truncated structure hypothesis” • Phonology, morphology, lexicon, syntax,etc
Focus: Syntax • Prior research emphasizes: • Lexicon, phonology, morphology • (Relatively) Easy to extract from • Recordings, manual transcriptions • Syntactic development • Significant markers in linguistic development • Acquisition of inflection, agreement, aux-inversion,. • Rich source of information
Challenges • Analyzing syntactic input & development • Requires parsing of child- and child-directed speech • Manual analysis prohibitively costly for lots • Resources for training: Treebanks • Newspaper, adult conversational speech • Current speech: • Child-directed: very conversational, ellipsis, • Vocatives, onomatopoeia • Child: fragments, possibly ungrammatical
Resources • CHILDES • Corpus (100s MB) of child language data • Many languages • All manually transcribed • Marked for disfluency, repetition, retracing • Some manually morphologically analyzed, POS • Some audio/video available • Not syntactically analyzed • Specialized morphological analyzer, POS
Examples • *CHI: more cookie. • %mor: qn|more n|cookie. • *MOT: how about another graham cracker? • %mor: adv:wh|how prep|about^adv|about det|another n|graham n|cracker?
Parsing for Assessment • Syntactic analysis of child speech • Assign to particular developmental level • Based on presence, frequency of constructions • Measures: • MLU – Mean length of utterance • Reaches ceiling around age 3, uninformative • IPSyn – Explicitly measures syntactic structure • Scores 100 utts on 56 structures • NP,VP, questions, sentence structure • 0=absent; 1=found once; 2=found > once • Some identifiable from POS/morph and patterns • Others not: aux-inversion, conjunction, subord clauses, etc
Syntactic Analysis Approach • Extract grammatical relations (GRs) • Analyze sentence to labeled dependencies • Aux, Neg, Det, Inf, (ECX)Subj, Objs, Preds, Mods • (CX)Jct, (X)Comp, etc • Decomposition: • Text processing: • Removes tagged disfluencies, reps, retracings • Morphological analysis, POS tagging (special) • Unlabeled dependency analysis • Dependency labeling
Unlabeled Dependency Parsing • Identify unlabeled dependency • Parse with Charniak parser • Trained on Penn treebank • Convert to dependencies based on head table • Different domain: 90.1% vs 92% WSJ • Shorter! < 15 wds
Dependency Labeling • Assign labels to dependency structure • Easier than finding dep structure itself • Labels more separable • 30 way classification: • Train on 5K words w/manual dependency labels • TiMBL, features include: • Head & dep words, POS; order, separation, label of subsumer • 91% accuracy • Parse+labels: ~87% • Some trivial: Det, INF: 98%; (X)Comp: ~60% • Competitive
Automating Assessment • Prior work: Computerized Profiling (CP) • Exploits word/POS patterns • Limited for older children; more sophisticated • Generally used in semi-automatic approach • Syntactic analysis improves • Before, he told the man he was cold. • Before he told the story, he was cold • Same POS pattern, similar words; diff’t structure • GR with clausal type (e.g. COMP), dep left • Construct syntactically informed patterns
Evaluation • Point difference: • Unsigned difference in scores (man vs auto) • Point-to-point accuracy: • Number of language structure decisions correct • Divided by number of decisions • Test data: • A) 20 trans: ages 2-3; manual; MLU 2.9 • B) 25 trans: ages 8-9; semi-auto CP; MLU 7.0
Results • Contrast w/ human assessment; CP • Point difference: 3.3 total GR; 8.3 CP • CP worse on older children: 6.2 vs 10.2 • Less effective on more complex sentences • Point-to-point: 92% GR; ~85% CP • No pattern of miss vs false detect • GR automatic scoring: high agreement
Error Analysis • 4 of 56 IPSyn structures -> 50% of errors • Propositional complements, rel. clauses, bitransitive predicates • Result of syntactic analysis errors • Esp. COMP – least accurate • Emphasis/Ellipsis • Bad search patterns • More reliable than POS/word, but still hand-crafted
Conclusion • Automatic analysis of child language data • Syntax, beyond morphology & POS • Two-phrase dependency analysis • Unlabeled structure, followed by label assignment • Accurate even with out-of-domain training • Enables more nuanced assessment • Especially as learner syntax becomes complex