480 likes | 716 Views
Corpora and Statistical Methods Lecture 11. Albert Gatt. Part 2. Statistical parsing. Preliminary issues. How parsers are evaluated. Evaluation. The issue: what objective criterion are we trying to maximise?
E N D
Corpora and Statistical MethodsLecture 11 Albert Gatt
Part 2 Statistical parsing
Preliminary issues How parsers are evaluated
Evaluation • The issue: • what objective criterion are we trying to maximise? • i.e. under what objective function can I say that my parser does “well” (and how well?) • need a gold standard • Possibilities: • strict match of candidate parse against gold standard • match of components of candidate parse against gold standard components
Evaluation • A classic evaluation metric is the PARSEVAL one • initiative to compare parsers on the same data • not initially concerned with stochastic parsers • evaluate parser output piece by piece • Main components: • compares gold standard tree to parser tree • typically, gold standard is the tree in a treebank • computes: • precision • recall • crossing brackets
Correct node = node in candidate parse which: has same node label originally omitted from PARSEVAL to avoid theoretical conflict spans the same words PARSEVAL: labeled recall
The proportion of correctly labelled and correctly spanning nodes in the candidate. PARSEVAL: labeled precision
Combining Precision and Recall • As usual, Precision and recall can be combined into a single F-measure:
PARSEVAL: crossed brackets • number of brackets in the candidate parse which cross brackets in the treebank parse • e.g. treebank has ((X Y) Z) and candidate has (X (Y Z)) • Unlike precision/recall, this is an objective function to minimise
Current performance • Current parsers achieve: • ca. 90% precision • >90% recall • 1% cross-bracketed constituents
Some issues with PARSEVAL • These measures evaluate parses at the level of individual decisions (nodes). • ignore the difficulty of getting a globally correct solution by carrying out a correct sequence of decisions • Success on crossing brackets depends on the kind of parse trees used • Penn Treebank has very flat trees (not much embedding), therefore likelihood of crossed brackets decreases. • In PARSEVAL, if a constituent is attached lower in a tree than the gold standard, all its daughters are counted wrong.
Probabilistic parsing with PCFGs The basic algorithm
The basic PCFG parsing algorithm • Many statistical parsers use a version of the CYK algorithm. • Assumptions: • CFG productions are in Chomsky Normal Form. • A BC • A a • Use indices between words: • Book the flight through Houston • (0) Book (1) the (2) flight (3) through (4) Houston (5) • Procedure (bottom-up): • Traverse input sentence left-to-right • Use a chart to store constituents and their span + their probability.
Probabilistic CYK: example PCFG • S NP VP [.80] • NP Det N [.30] • VP V NP [.20] • V includes [.05] • Det the [.4] • Det a [.4] • N meal [.01] • N flight [.02]
Probabilistic CYK: initialisation //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.
Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.
Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.
Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.
Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.
Probabilistic CYK: lexical step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.
Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} • The flight includes a meal.
Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.
Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.
Probabilistic CYK: syntactic step //Lexical lookup: for j = 1 to length(string) do: chartj-1,j := {X : X->word in G} //syntactic lookup for i = j-2 to 0 do: chartij := {} for k = i+1 to j-1 do: for each A -> BC do: if B in chartik& C in chartkj: chartij := chartij U {A} • The flight includes a meal.
Probabilistic CYK: summary • Cells in chart hold probabilities • Bottom-up procedure computes probability of a parse incrementally. • To obtain parse trees, cells need to be augmented with backpointers.
Probabilistic parsing with lexicalised PCFGs Main approaches (focus on Collins (1997,1999)) see also: Charniak (1997)
Unlexicalised PCFG Estimation • Charniak (1996) used Penn Treebank POS and phrasal categories to induce a maximum likelihood PCFG • only used relative frequency of local trees as the estimates for rule probabilities • did not apply smoothing or any other techniques • Works surprisingly well: • 80.4% recall; 78.8% precision (crossed brackets not estimated) • Suggests that most parsing decisions are mundane and can be handled well by unlexicalized PCFG
Probabilistic lexicalised PCFGs • Standard format of lexicalised rules: • associate head word with non-terminal e.g. dumped sacks into VP(dumped) VBD(dumped) NP(sacks) PP(into) • associate head tag with non-terminal VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Types of rules: • lexical rules expand pre-terminals to words: • e.g. NNS(sacks,NNS) sacks • probability is always 1 • internal rules expand non-terminals • e.g. VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN)
Estimating probabilities • Non-generative model: • take an MLE estimate of the probability of an entire rule • non-generative models suffer from serious data sparseness problems • Generative model: • estimate the probability of a rule by breaking it up into sub-rules.
Collins Model 1 • Main idea: • represent CFG rules as expansions into Head + left modifiers + right modifiers • Li/Ri is of the form L/R(word,tag); e.g. NP(sacks,NNS) • STOP: special symbol indicating left/right boundary. • Parsing: • Given the LHS, generate the head of the rule, then the left modifiers (until STOP) and right modifiers (until STOP) inside-out. • Each step has a probability.
Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht):
Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers:
Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers: • Right modifiers:
Collins Model 1: example VP(dumped,VBD) VBD(dumped,VBD) NP(sacks,NNS) PP(into,IN) • Head H(hw,ht): • Left modifiers: • Right modifiers: • Total probability: multiplication of (1) – (3)
Variations on Model 1: distance • Collins proposed to extend rules by conditioning on distance of modifiers from the head: • a function of the yield of modifiers seen. Distance for R2 probability = words under R1
Using a distance function • Simplest kind of distance function is a tuple of binary features: • Is the string of length 0? • Does the string contain a verb? • … • Example uses: • if the string has length 0, PR should be higher: • English is right-branching & most right modifiers are adjacent to the head verb • if string contains a verb, PR should be higher: • accounts for preference to attach dependencies to main verb
Further additions • Collins Model 2: • subcategorisation preferences • distinction between complements and adjuncts. • Model 3 augmented to deal with long-distance (WH) dependencies.
Rules may condition on words that never occur in training data. Collins used 3-level backoff model. Combined using linear interpolation. use head word use head tag parent only Smoothing and backoff
Data-oriented parsing • Alternative to “grammar-based” models • does not attempt to derive a grammar from a treebank • treebankdata is stored as fragments of trees • parser uses whichever trees seem to be useful
Data-oriented parsing • Suppose we want to parse Sue heard Jim. • Corpus contains the following potentially useful fragments: Parser can combine these to give a parse
Data-oriented Parsing • Multiple fundamentally distinct derivations of a single tree. • Parse using Monte Carlo simulation methods: • randomly produce a large sample of derivations • use these to find the most probable parse • disadvantage: needs very large samples to make parses accurate, therefore potentially slow
Data-oriented parsing vs. PCFGs • Possible advantages: • using partial trees directly accounts for lexical dependencies • also accounts for multi-word expressions and idioms (e.g. take advantage of) • while PCFG rules only represent trees of depth 1, DOP fragments can represent trees of arbitrary length • Similarities to PCFG: • tree fragments could be equivalent to PCFG rules • probabilities estimated for grammar rules are exactly the same as for tree fragments
History Based Grammars (HBG) • General idea: any derivational step can be influenced by any earlier derivational step • (Black et al. 1993) • the probability of expansion of the current node conditioned on all previous nodes along the path from the root
History Based Grammars (HBG) • Black et al lexicalise their grammar. • every phrasal node inherits 2 words: • its lexical head H1 • a secondary head H2, deemed to be useful • e.g. the PP in the bank might have H1=in and H2=bank • Every non-terminal is also assigned: • a syntactic category (Syn) e.g. PP • a semantic category (Sem) e.gwith-Data • Use the index I that indicates what number child of the parent node is being expanded
History Based Grammars (HBG) • Estimation of the probability of a rule R: • probability of: • the current rule R to be applied • its Syn and Sem category • its heads H1 and H2 • conditioned on: • Syn and Sem of parent node • the rule that gave rise to the parent • the index of this child relative to the parent • the heads H1 and H2 of the parent
Summary • This concludes our overview of statistical parsing • We’ve looked at three important models • Also considered basic search techniques and algorithms