230 likes | 551 Views
Parsing. See: R Garside, G Leech & A McEnery (eds) Corpus Annotation , London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy, An introduction to corpus linguistics , London (1998): Longman, pp. 231-244.
E N D
Parsing See: R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997): Longman, chapters 11 (Bateman et al) and 12 (Garside & Rayson) G Kennedy, An introduction to corpus linguistics, London (1998): Longman, pp. 231-244. CF Meyer, English corpus linguistics, Cambridge (2002): CUP, pp. 91-96. R Mitkov (ed) The Oxford Handbook of Computational Linguistics, Oxford (2003): OUP, chapter 4 (Kaplan) J Allen Natural Language Understanding (2nd ed) (1994): Addison Wesley
Parsing • POS tags give information about the individual words, and their internal form (eg sing vs plur, tense of verb) • Additional level of information concerns the way the words relate to each other • the overall structure of each sentence • the relationships between the words • This can be achieved by parsing the corpus
Parsing – overview • What sort of information does parsing add? • What are the difficulties relating to parsing? • How is parsing done? • Parsing and corpora • partial parsing, chunking • stochastic parsing • treebanks
Structural information • Parsing adds information about sentence structure and constituents • Allows us to see what constructions words enter into • eg, transitivity, passivization, argument structure for verbs • Allows us to see how words function relative to each other • eg, what words can modify / be modified by other words
S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . [S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] Nemo , the killer whale , who ‘d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park .
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . given this verb, what kinds of things can be subject?
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . verb with adjective complement: what verbs can participate in this construction? with what adjectives? any other constraints?
[S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] S N Fr V V J P P N N P P N N N N NP1 , AT NN1 NN1 , PNQS V VVN RG JJ IF APP$ NN1 II NP1 NNL1 , VHZ VVN RR II APP$ JJ NN1 II NP1 NN1 NNL1 . Nemo , the killer whale , who ’d grown too big for his pool on Clacton Pier , has arrived safely at his new home in Windsor safari park . verb with PP complement: what verbs with what prepositions? any constraints on noun?
Parsing: difficulties • Besides lexical ambiguities (usually resolved by tagger), language can be structurally ambiguous • global ambiguities due to ambiguous words and/or alternative possible combinations • local ambiguities, especially due to attachment ambiguities, and other combinatorial possibilities • sheer weight of alternatives available in the absence of (much) knowledge
Global ambiguities • Individual words can be ambiguous as to category • In combination with each other this can lead to ambiguity: • Time flies like an arrow • Gas pump prices rose last time oil stocks fell
Local ambiguities • Structure of individual constituents may be given, but how they fit together can be in doubt • Classic example of PP attachment • The man saw the girl with the telescope The man saw the girl in the park with a statue of the general on a horse with a telescope in the morning on a stand with a red dress with a sword • Many other attachments potentially ambiguous • relative clauses, adverbs, parentheticals, etc
Difficulties • Broad coverage necessary for parsing corpora of real text • Long sentences: • structures are very complex • ambiguities proliferate • Difficulty (even for human) to verify if parse is correct • because it is complex • because it may be genuinely ambiguous
How to parse • Traditionally (in linguistics) • hand-written grammar • usually narrow coverage • linguists are interested in theoretical issues regarding syntax • Even in computational linguistics • interest is (was?) in parsing algorithms • In either case, grammars typically used small set of categories (N, V, Adj etc)
Lack of knowledge • Humans are very good at disambiguating • In fact they rarely even notice the ambiguity • Usually, only one reading “makes sense” • They use a combination of • linguistic knowledge • common-sense (real-world) knowledge • contextual knowledge • Only the first is available to computers, and then only in a limited way
Parsing corpora • Using tagger as a front-end changes things: • Richer set of grammatical categories which reflect some morphological information • Hand-written grammars more difficult though because many generalisations lost (eg now need many more rules for NP) • Disambiguation done by tagger in some sense pre-empts work that you might have expected the parser to do
Parsing corpora • Impact of broad coverage requirement • Broad coverage means that many more constructions are covered by the grammar • This increases ambiguity massively • Partial parsing may be sufficient for some needs • Availability of corpora permits (and encourages) stochastic approach
Partial parsing • Identification of constituents (noun phrases, verb groups, PPs) is often quite robust … • Only fitting them together can be difficult • Although some information is lost, identifying “chunks” can be useful
Stochastic parsing • Like ordinary parsing, but competing rules are assigned a probability score • Scores can be used to compare (and favour) alternative parses • Where do the probabilities come from? S NP VP .80 S aux NP VP .15 S VP .05 NP det n .20 NP det adj n .35 NP n .20 NP adj n .15 NP pro .10
Where do the probabilities come from? • Use a corpus of already parsed sentences: a “treebank” • Best known example is the Penn Treebank • Marcus et al. 1993 • Available from Linguistic Data Consortium • Based on Brown corpus + 1m words of Wall Street Journal + Switchboard corpus • Count all occurrences of each rule variation (e.g. NP) and divide by total number of NP rules • Very laborious, so of course is done automatically
Where do the probabilities come from? • Create your own treebank from your own corpus • Easy if all sentences are unambiguous: just count the (successful) rule applications • When there are ambiguities, rules which contribute to the ambiguity have to be counted separately and weighted
Where do the probabilities come from? • Learn them as you go along • Again, assumes some way of identifying the correct parse in case of ambiguity • Each time a rule is successfully used, its probability is adjusted • You have to start with some estimated probabilities, e.g. all equal • Does need human intervention, otherwise rules become self-fulfilling prophecies
Bootstrapping the grammar • Start with a basic grammar, possibly written by hand, with all rules equally probable • Parse a small amount of text, then correct it manually • this may involve correcting the trees and/or changing the grammar • Learn new probabilities from this small treebank • Parse another (similar) amount of text, then correct it manually • Adjust the probabilities based on the old and new trees combined • Repeat until the grammar stabilizes
Treebanks – some examples (with links) • Penn perhaps best known • Wall Street Corpus, Brown Corpus; >1m words • International Corpus of English (ICE); • Lancaster Parsed Corpus and Lancaster-Leeds treebank • parsed excerpts from LOB; 140k and 45k words resp. • Susanne Corpus, Christine Corpus, Lucy Corpus; • related to Lancaster corpora; developed by Geoffrey Sampson • Verbmobil treebanks • parallel treebanks (Eng, Ger, Jap) used in speech MT project • LinGO Redwoods: HPSG-based parsing of Verbmobil data • Multi-Treebank • parses in various frameworks of 60 sentences • The PARC 700 Dependency Bank; • LFG parses of 700 sentences also found in Penn treebank • CHILDES • Brown Eve corpus of children’s speech samples with dependency annotation