Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models Elias Ponvert, Jason Baldridge and Katrin Erk The University of Texas at Austin

Introduction • Grammar Induction • Based on gold standard POS • Foundamental one: Constituent Context Model (CCM) • Based on raw texts • Common cover links parser: CCL • This paper: cascaded chunking.

Motivation of this paper • CCL depends on low-level constituents very much: • Simply extracting non-hierarchical multiword constituents from CCL’s output and putting a right branching structure over them actually works better than CCL’s own higher level predictions. • Suggestion: improvements to low-level constituent prediction will ultimately lead to further gains in overall constituent parsing

Two Investigations • Unsupervised partial parsing or unsupervised chunking • Full parsing via cascaded chunking (explain later)

Data of Unsupervised Chunking • Two kinds of data: • Constituent chunks • Multiword • Non-hierarchical (do not contain sub constituents) • Base NP: NPs that do not contain nested NPs

Method of Unsupervised Chunking • BIO tagging, and STOP for sentence boundaries and phrasal punctuations. • Model: • HMM • PRLG (probabilistic right-linear grammar)

Finite States • State transitions • Uniform initialization

Chunking Results

Full parsing via cascaded chunking Pseudoword: the term in the chunk with the highest corpus frequency

Full Parsing Results • No length limit • <=10 words

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models

Presentation Transcript

Finite Simple Groups

Finite Models

Text grammar

Finite Simple Groups

Text grammar

Unsupervised Learning of Finite Mixture Models

Lecture 16: Unsupervised Learning from Text

A Survey of Unsupervised Grammar Induction

Finite-State Machines with Output

Finite-State Machines with Output

Bandera: Extracting Finite-state Models from Java Source Code

Extended Finite-State Machine Induction using SAT-Solver

Testing from Finite State Machines

Grammar Induction

Bandera: Extracting Finite-state Models from Java Code

Lecture 16: Unsupervised Learning from Text

(Finite) Mathematical Induction

Grammar Induction

Recursive Unsupervised Learning of Finite Mixture Models

Finite Simple Groups

Finite Models

FINITE STATE AUTOMATA WITH OUTPUT