Parsing

Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp. 231-244. CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp. 91-96. R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003): OUP, chapter 4 (Kaplan) J Allen Natural Language Understanding (2nd ed) (1994): Addison Wesley

Parsing • POS tags give information about the individual words, and their internal form (eg sing vs plur, tense of verb) • Additional level of information concerns the way the words relate to each other • the overall structure of each sentence • the relationships between the words • This can be achieved by parsing the corpus

Parsing – overview • What sort of information does parsing add? • What are the difficulties relating to parsing? • How is parsing done? • Parsing and corpora • partial parsing, chunking • stochastic parsing • treebanks

Structural information • Parsing adds information about sentence structure and constituents • Allows us to see what constructions words enter into • eg, transitivity, passivization, argument structure for verbs • Allows us to see how words function relative to each other • eg, what words can modify / be modified by other words

S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . [S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] Nemo , the killer whale , who ‘d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . given this verb, what kinds of things can be subject?

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . verb with adjective complement: what verbs can participate in this construction? with what adjectives? any other constraints?

[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . verb with PP complement: what verbs with what prepositions? any constraints on noun?

Parsing: difficulties • Besides lexical ambiguities (usually resolved by tagger), language can be structurally ambiguous • global ambiguities due to ambiguous words and/or alternative possible combinations • local ambiguities, especially due to attachment ambiguities, and other combinatorial possibilities • sheer weight of alternatives available in the absence of (much) knowledge

Global ambiguities • Individual words can be ambiguous as to category • In combination with each other this can lead to ambiguity: • Time flies like an arrow • Gas pump prices rose last time oil stocks fell

Local ambiguities • Structure of individual constituents may be given, but how they fit together can be in doubt • Classic example of PP attachment • The man saw the girl with the telescope The man saw the girl in the park with a statue of the general on a horse with a telescope in the morning on a stand with a red dress with a sword • Many other attachments potentially ambiguous • relative clauses, adverbs, parentheticals, etc

Difficulties • Broad coverage necessary for parsing corpora of real text • Long sentences: • structures are very complex • ambiguities proliferate • Difficulty (even for human) to verify if parse is correct • because it is complex • because it may be genuinely ambiguous

How to parse • Traditionally (in linguistics) • hand-written grammar • usually narrow coverage • linguists are interested in theoretical issues regarding syntax • Even in computational linguistics • interest is (was?) in parsing algorithms • In either case, grammars typically used small set of categories (N, V, Adj etc)

Lack of knowledge • Humans are very good at disambiguating • In fact they rarely even notice the ambiguity • Usually, only one reading “makes sense” • They use a combination of • linguistic knowledge • common-sense (real-world) knowledge • contextual knowledge • Only the first is available to computers, and then only in a limited way

Parsing corpora • Using tagger as a front-end changes things: • Richer set of grammatical categories which reflect some morphological information • Hand-written grammars more difficult though because many generalisations lost (eg now need many more rules for NP) • Disambiguation done by tagger in some sense pre-empts work that you might have expected the parser to do

Parsing corpora • Impact of broad coverage requirement • Broad coverage means that many more constructions are covered by the grammar • This increases ambiguity massively • Partial parsing may be sufficient for some needs • Availability of corpora permits (and encourages) stochastic approach

Partial parsing • Identification of constituents (noun phrases, verb groups, PPs) is often quite robust … • Only fitting them together can be difficult • Although some information is lost, identifying “chunks” can be useful

Stochastic parsing • Like ordinary parsing, but competing rules are assigned a probability score • Scores can be used to compare (and favour) alternative parses • Where do the probabilities come from? S  NP VP .80 S  aux NP VP .15 S  VP .05 NP  det n .20 NP  det adj n .35 NP  n .20 NP  adj n .15 NP  pro .10

Where do the probabilities come from? • Use a corpus of already parsed sentences: a “treebank” • Best known example is the Penn Treebank • Marcus et al. 1993 • Available from Linguistic Data Consortium • Based on Brown corpus + 1m words of Wall Street Journal + Switchboard corpus • Count all occurrences of each rule variation (e.g. NP) and divide by total number of NP rules • Very laborious, so of course is done automatically

Where do the probabilities come from? • Create your own treebank from your own corpus • Easy if all sentences are unambiguous: just count the (successful) rule applications • When there are ambiguities, rules which contribute to the ambiguity have to be counted separately and weighted

Where do the probabilities come from? • Learn them as you go along • Again, assumes some way of identifying the correct parse in case of ambiguity • Each time a rule is successfully used, its probability is adjusted • You have to start with some estimated probabilities, e.g. all equal • Does need human intervention, otherwise rules become self-fulfilling prophecies

Bootstrapping the grammar • Start with a basic grammar, possibly written by hand, with all rules equally probable • Parse a small amount of text, then correct it manually • this may involve correcting the trees and/or changing the grammar • Learn new probabilities from this small treebank • Parse another (similar) amount of text, then correct it manually • Adjust the probabilities based on the old and new trees combined • Repeat until the grammar stabilizes

Treebanks – some examples (with links) • Penn perhaps best known • Wall Street Corpus, Brown Corpus; >1m words • International Corpus of English (ICE); • Lancaster Parsed Corpus and Lancaster-Leeds treebank • parsed excerpts from LOB; 140k and 45k words resp. • Susanne Corpus, Christine Corpus, Lucy Corpus; • related to Lancaster corpora; developed by Geoffrey Sampson • Verbmobil treebanks • parallel treebanks (Eng, Ger, Jap) used in speech MT project • LinGO Redwoods: HPSG-based parsing of Verbmobil data • Multi-Treebank • parses in various frameworks of 60 sentences • The PARC 700 Dependency Bank; • LFG parses of 700 sentences also found in Penn treebank • CHILDES • Brown Eve corpus of children’s speech samples with dependency annotation

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing