Language and Speech Technology: Parsing

Language and Speech Technology: Parsing Jan Odijk January 2011 LOT Winter School 2011

Overview • Grammars & Grammar Types • Parsing • Naïve Parsing • Earley Parser • Example (using handouts) • Earley Parser Extensions • Parsers & CLARIN

Grammars • Grammar G = (VT, VN, P, S) where • VT terminal vocabulary • VN nonterminal vocabulary • P set of rules α→β (lhs → rhs) • αЄ VN+ • βЄ (VN U VT)* • S Є VN (start symbol)

Grammars • Example Grammar G = (VT, VN, P, S) with • VT = {the, a, garden, book, in,} • VN = {NP, Det, N, P, PP} • P = {PP→P NP, NP→Det N, Det→the, Det→a, N→garden, N→book, P→in } • S = PP

Example Derivation • PP (start symbol) • P NP (PP →P NP) • in NP (P → in) • in Det N (NP →Det N) • in the N (Det → the) • in the garden ( N → garden)

Grammar Types • Finite State Grammars (Type 3) • A → aA, A → a. A Є VN, a Є VT • Too weak to deal with natural language in toto • Efficient processing techniques • Often used for applications where partial analyses of natural language are sufficient • Often used for morphology / phonology

Grammar Types • Context-Free Grammars (CFG, Type 2) • A →β. A Є VN • To weak to deal with natural language • Surely for strong generative adequacy • Also for weak generative adequacy • Reasonably efficient processing techniques • Generally taken as a basis for dealing with natural language, extended with other techniques

Grammar Types • Context-Sensitive Grammars (Type 1) • α→β, |α| <= |β| • Usually not considered in the context of NLP • Type-0 grammars • No restrictions • Usually not considered except in combination with CFG

Parsing • Parsing • Is an algorithm • It must finish! • For assigning syntactic structures • Ambiguity! • To a sequence of terminal symbols • In accordance with a given grammar • (If possible, efficient)

Parsing for CFGs • Focus here on • Parser for CFGs • for natural language • More specifically: Earley parser • Why? • Most NLP systems with a grammar use a parser for CFG as a basis • Basic techniques will also recur in parsers for different grammar types

Naïve Parsing • see handout • Problems for naïve parsing • A lot of re-parsing of subtrees • Bottom-up • Wastes time and space on trees that cannot lead to S • Top-down • Wastes time and space on trees that cannot match input string

Naïve parsing • Top-down • Recursion problem • Can be solved for right-recursion by matching with input tokens, but • Problem with left recursion remains: • NP → NP PP • Ambiguity • Temporary ambiguity • Real ambiguity

Naïve parsing • Naïve Parsing Complexity • Time needed to parse is exponential: • cn (c a constant, length input tokens) • (in the worst case) • Takes too much time • Is not practically feasible

Earley Parser • Top-down approach but • Predictor avoids wasting time and space on irrelevant trees • Does not build actual structures, but stores enough information to reconstruct structures • Uses dynamic programming technique to avoid recomputation of subtrees • Avoids problems with left recursion • Makes complexity cubic: n3

Earley Parser • Number positions in input string (0 .. N) • 0 book 1 that 2 flight 3 • Notation [i,j] stands for the string from position i to position j • [0,1] = “book” • [1,3] = “that flight” • [2,2]= “”

Earley Parser • Dotted Rules • is a grammar rule + indication of progress • ie. Which elements of the rhs have been seen yet and which ones not yet • Indicated by a dot (we use an asterisk) • Example • S → Aux NP * VP • Aux and NP have been dealt with but VP not yet

Earley Parser • Input: • Sequence of N words (words[1..N]), and • grammar • Output: • a Store = (agenda, chart) • (sometimes chart = N+1 chart entries: chart[0 .. N])

Earley Parser • Agenda, chart: sets of states • A state consists of • Dotted rule • Span relative to the input: [i,j] • Previous states: list of state identifiers • And gets a unique identifier • Example • S11: VP → V’ * NP; [0,1]; [S8]

Earley Parser • State • Is complete • iff dot is the last element in the dotted rule • E.g. state with VP → Verb NP * is complete • NextCat (state) • Only applies if state is not complete • Is the category immediately following the dot • VP → Verb * NP : NextCat(state)= NP

Earley Parser • 3 operations on states, • Predictor • Predicts which categories to expect • Scanner • if a terminal category C is expected, and a word of category C is encountered in this position, • Consumes the word and shifts the dot • Completer • Applies to a complete state s, and modifies all states that gave rise to this state

Earley Parser • Predictor • Applies to an incomplete state • ( A → α * B β, [i,j], _) • B is a nonterminal • For each (B → γ) in grammar • Make a new state s = (B → * γ, [j,j], []) • enqueue(s , store) • Enqueue (s,ce) = add s to ce unless ce already contains s

Earley Parser • Scanner • Applies to an incomplete state • ( A → α * b β, [i,j], _) • b is a terminal • Make a new state s = (b → words[j] * , [j,j+1], []) • enqueue(s , store)

Earley Parser • Completer • Applies to an complete state • ( B → γ *, [j,k], L1) • For each (A → α * B β, [i,j], L2) in chart[j] • Make new state s = (A → α B * β, [i,k], L2 ++ L1) • enqueue(s , store)

Earley Parser • Store = (agenda, chart) • Apply operations on states in the agenda until the agenda is empty • When applying an operation to a state s in the agenda • Move the state s from the agenda into the chart • Add the resulting states of the operation to the agenda

Earley Parser • Initial store = ([Г → *S], emptychart) • Where Г is a ‘fresh’ nonterminal start symbol • Input sentence accepted • Iff there is a state (Г → S *, [0,N], LS) in the chart and the agenda is empty • Parse tree(s) can be reconstructed via the list of earlier states (LS)

Earley Parser Extensions • Replace elements of V by feature sets (attribute-value matrices, AVMs) • Harmless if finitely valued • E.g. instead of NP [cat=N, bar=max, case=Nom] • Usually other relation than ‘=‘ used for comparison • E.g. ‘is compatible with’, ‘unifies with’, ‘subsumes’

Earley Parser Extensions • Replace rhs of rules by regular expressions over V (or AVMs) • E.g. VP → V NP? (AP | PP)* abbreviates • VP → V, VP → V NP, VP → V APorPP, VP → V NP APorPP, • APorPP → AP APorPP, APorPP → PP APorPP, APorPP → AP, APorPP → PP • Where APorPP is a ‘fresh’ virtual nonterminal • Virtual : is discarded when constructing the trees

Earley Parser Extensions • My grammatical formalism has no PS rules! • But only ‘lexical projection’ of syntactic selection properties (subcategorization list) • E.g. buy: [cat=V, subcat = [_ NP PP, _ NP]] •  create PS rules on the fly • If buy occurs in the input tokens, create rules • VP → buy NP PP and VP → buy NP • From the lexical entry • And use these rules to parse

Earley Parser Extensions • My grammar contains ε-rules: • NP → ε • Where ε stands for the empty string • (i.e. NP matches the empty string in the input token list) • Earley parser can deal with these! • But extensive use creates many ambiguities!

Earley Parser Extensions • My grammar contains empty categories • Independent • PRO as subject of non-finite verbs • PRO buying books is fun • pro as subject of finite verbs in pro-drop languages • pro no hablo Español • Pro as subject of imperatives • pro schaam je! • Epsilon rules can be used or represent this at other level

Earley Parser Extensions • My grammar contains empty categories • Dependent • trace of wh-movement • What did you buy t • Trace of Verb movement (e.g V2 in Dutch, German, Aux movement in English • Hij belt hem op t • Did you t buy a book? • Epsilon rules are not sufficient

Earley Parser Extensions • Other types (levels) of representation • LFG: (c-structure, f-structure) • HPSG: DAGs (special type of AVMs) • (constituent structure, semantic representation) • Use CFG as backbone grammar • Which accepts a superset of the language • For each rule specify how to construct other level of representation • Extend Earley parser to deal with this

Earley Parser Extensions • Other types (levels) of representation • f-structure, DAGs, semantic representations are not finitely valued • Thus it will affect efficiency • But allows dealing with e.g. • Non-context-free aspects of a language • Unbounded dependencies (e.g. by ‘gap-threading’)

Earley Parser in Practice • Parsers for natural language yield • Many many parse trees for an input sentence • Many more than you can imagine (thousands) • Even for relatively short, simple sentences • They are all syntactically correct • But make no sense semantically

Earley Parser in Practice • Additional constraining is required • To reduce the temporary ambiguities • To come up with the ‘best’ parse • Can be done by semantic constraints • But only feasible for very small domains • Is most often done using probabilities • Rule probabilities derived from frequencies in treebanks

Parsers: Some Examples • Dutch: Alpino parser • Stanford parsers • English, Arabic, Chinese • English: ACL Overview

Parsers & CLARIN • Parser allows one to automatically analyze large text corpora • Resulting in treebanks • Can be used for linguistic research • But with care!! • Example: Lassy Demo (Dutch) • Simple search interface to LASSY-small Treebank • Use an SVG compatible browser (e.g. Firefox)

Parsers & CLARIN • Example of linguistic research using a treebank: • Van Eynde 2009: A treebank-driven investigation of predicative complements in Dutch

Thanks for your attention!

Language and Speech Technology: Parsing