110 likes | 238 Views
Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models. Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas at Austin. Introduction. Grammar Induction Based on gold standard POS Foundamental one: Constituent Context Model (CCM)
E N D
Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas at Austin
Introduction • Grammar Induction • Based on gold standard POS • Foundamental one: Constituent Context Model (CCM) • Based on raw texts • Common cover links parser: CCL • This paper: cascaded chunking.
Motivation of this paper • CCL depends on low-level constituents very much: • Simply extracting non-hierarchical multiword constituents from CCL’s output and putting a right branching structure over them actually works better than CCL’s own higher level predictions. • Suggestion: improvements to low-level constituent prediction will ultimately lead to further gains in overall constituent parsing
Two Investigations • Unsupervised partial parsing or unsupervised chunking • Full parsing via cascaded chunking (explain later)
Data of Unsupervised Chunking • Two kinds of data: • Constituent chunks • Multiword • Non-hierarchical (do not contain sub constituents) • Base NP: NPs that do not contain nested NPs
Method of Unsupervised Chunking • BIO tagging, and STOP for sentence boundaries and phrasal punctuations. • Model: • HMM • PRLG (probabilistic right-linear grammar)
Finite States • State transitions • Uniform initialization
Full parsing via cascaded chunking Pseudoword: the term in the chunk with the highest corpus frequency
Full Parsing Results • No length limit • <=10 words