220 likes | 360 Views
LING/C SC/PSYC 438/538. Lecture 27 Sandiway Fong. Administrivia. 2 nd Reminder 538 Presentations Send me your choices if you haven’t already. english26.pl. [Background: chapter 12 of JM contains many grammar rules] Subject of passive in by-phrase the sandwich was eaten by John
E N D
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong
Administrivia • 2nd Reminder • 538 Presentations • Send me your choices if you haven’t already
english26.pl [Background: chapter 12 of JM contains many grammar rules] • Subject of passive in by-phrase • the sandwich was eaten by John • Questions • Did John eat the sandwich? • Is the sandwich eaten by John? • Was John eating the sandwich? • Who ate the sandwich? • What did John eat? (do-support) • Which sandwich did John eat? • Why did John eat the sandwich?
english26.pl • Displacement rule: • was John eating the sandwich ? • was John was eating the sandwich • Ax [NP ] [VP Ax ]
english26.pl • Displacement rule: • was John eating the sandwich ? • was John was eating the sandwich • Ax [NP ] [VP Ax ]
english26.pl • Yes-no question (without aux inversion) Blocked by {Ending=root} constraint
english26.pl • Passives and progressives: nested constructions … [Progressive [Passive] ] [Passive [Progressive] ]
english26.pl • Nesting order forced by rule chaining • progressive passive VP_nptrace • passive VP_nptrace • passive progressive …
Homework 5 English grammar programming • use english26.pl • Add rules to handle the following sentences • Part 1: Raising verbs • John seems to be happy • It seems that John is happy • It seems that John be happy • John seems is happy • Part 2: PP attachment ambiguity • I saw the boy with a telescope • is ambiguous between two readings • Your grammar should produce both parses • Part 3: recursion and relative clauses • I recognized the man • I recognized the man who recognized you • I recognized the man who recognized the woman who you recognized • Explain your parse trees • Submit your grammar and runs.
Why can’t computers use English? • Context • a linguist’s view: • a list of examples that are hard for computers to do • a computational linguist’s view (mine): • these actually aren’t very hard at all... armed with some DCG technology, we can easily write a grammar to that make the distinctions outlined in the pamphlet • You could easily… • write a grammar for these examples Online parsers: Berkeley parser/Stanford parser trained on the Penn treebank
If computers are so smart, why can't they use simple English? • Consider, for instance, the four letters read; they can be pronounced as either reed or red. How does the machine know in each case which is the correct pronunciation? Suppose it comes across the following sentences: • (l) The girls will read the paper. (reed) • (2) The girls have read the paper. (red) • We might program the machine to pronounce read as reed if it comes right after will, and red if it comes right after have. But then sentences (3) through (5) would cause trouble. • (3) Will the girls read the paper? (reed) • (4) Have any men of good will read the paper? (red) • (5) Have the executors of the will read the paper? (red) • How can we program the machine to make this come out right?
If computers are so smart, why can't they use simple English? • (6) Have the girls who will be on vacation next week read the paper yet? (red) • (7) Please have the girls read the paper. (reed) • (8) Have the girls read the paper?(red) • Sentence (6) contains both have and will before read, and both of them are auxiliary verbs. But will modifies be, and have modifies read. In order to match up the verbs with their auxiliaries, the machine needs to know that the girls who will be on vacation next week is a separate phrase inside the sentence. • In sentence (7), have is not an auxiliary verb at all, but a main verb that means something like 'cause' or 'bring about'. To get the pronunciation right, the machine would have to be able to recognize the difference between a command like (7) and the very similar question in (8), which requires the pronunciation red.
Example • (5) Have the executors of the will read the paper? (red)
Treebanks • Treebank • A corpus of sentences • Each sentence has been parsed • POS tags assigned • Also labels for phrases • A treebank • is also a grammar • we can extract the rules, also frequency counts • A consistently labeled treebank • might be called a “grammatical theory” • Most popular treebank • Penn Treebank • Available on cd from UA Library (search the catalog) • particularly the Wall Street Journal (WSJ) section • 50,000 sentences • used for training stochastic context-free grammars (PCFGs) • Results: around the 90% mark on bracketed precision/recall • contains also traces and indices (typically not used for PCFGs)
What is in it? (v3) Four parsed sections One million words of 1989 Wall Street Journal (WSJ) material ATIS-3 sample Switchboard Brown Corpus Example:wsj_001.mrg ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) In the NLP literature, “Penn Treebank” usually refers to the WSJ section only Penn Treebank Pierre Vinken, 61 years old, will join the board as nonexecutive director Nov. 29.
What is in it? Part-of-speech (POS) labels on words, numbers and punctuation using the 48-tag Penn tagset (a simplification of the 1982 Francis & Kucera Brown corpus tagset), e.g. NN, VB, IN, JJ Constituents identified and labeled with syntactic categories, e.g. S, NP, VP, PP Additional sublabels to facilitate predicate-argument extraction, e.g. -SBJ, -CLR, -TMP Penn Treebank
Penn Treebank • WSJ section of the Penn Treebank • has become the standard training corpus and testbed for statistical NLP • Other Penn treebanks • Arabic, Chinese and Korean • Other formalisms • (Combinatory Categorial Grammar) CCG treebank • Dependency grammar • http://en.wikipedia.org/wiki/Treebank • lists about 50 treebanks in 29 languages
The formalism chosen (sorta) matters Penn Treebank includes empty categories (ECs), including traces CCG has slash categories Dependency grammar-based treebanks don’t – also don’t have node labels Example: wsj_100.mrg ( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (, ,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2) ) Penn Treebank
Example: wsj_100.mrg ( (S (NP-SBJ (NNP Nekoosa) ) (VP (VBZ has) (VP (VBN given) (NP (DT the) (NN offer) ) (NP (NP (DT a) (JJ public) (JJ cold) (NN shoulder) ) (, ,) (NP (NP (DT a) (NN reaction) ) (SBAR (WHNP-2 (-NONE- 0) ) (S (NP-SBJ (NNP Mr.) (NNP Hahn) ) (VP (VBZ has) (RB n't) (VP (VBN faced) (NP (-NONE- *T*-2) (PP-LOC (IN in) (NP (NP (PRP$ his) (CD 18) (JJR earlier) (NNS acquisitions) ) (, ,) (SBAR (WHNP-3 (DT all) (WHPP (IN of) (WHNP (WDT which) ))) (S (NP-SBJ-1 (-NONE- *T*-3) ) (VP (VBD were) (VP (VBN negotiated) (NP (-NONE- *-1) ) (PP-LOC (IN behind) (NP (DT the) (NNS scenes) )))))))))))))))) (. .) )) Penn Treebank
Penn Treebank • The formalism chosen (sorta) matters • Penn Treebank includes empty categories, including traces • It is standard in the statistical NLP literature to first discard all the empty category information • both for training and evaluation • some exceptions: • Collins Model 3 • post processing to re-insert ecs
How is it used? One million words of 1989 Wall Street Journal (WSJ) material nearly 50,000 sentences (49,208) divided into 25 sections (0–24) sections 2–21 contain 39,832 sentences section 23 (2,416 sentences) is held out for evaluation Standard practice 0 training sections 2–21 90% 23 evaluation 24 Penn Treebank
Treebank Software • Tgrep2 • by Doug Rohde • http://tedlab.mit.edu/~dr/TGrep2/ • Download and install for Linux (pre-compiled and works without compilation on your Linux if you’re lucky) • For MacOSX just re-compile • (also will need the DRUtils library) • described in the textbook • works on the command line • Java Package • Tregex from Stanford • broadly compatible with Tgrep2 • http://nlp.stanford.edu/software/tregex.shtml • Jar file (should run on all platforms) • has a graphical user interface • file run-tregex-gui.bat • (batch file for Windows) • See file: set max memory to 500m (or larger) to use with entire treebank • Also TIGERsearch • http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/ • Windows explicitly supported