290 likes | 438 Views
LING 581: Advanced Computational Linguistics. Lecture Notes January 26th. Penn Treebank. Bracketing guidelines. Ungraded Homework Exercise. Search for NP trace relative clauses as defined below:. Be ready to c ompare search pattern and number f ound next time in class.
E N D
LING 581: Advanced Computational Linguistics Lecture Notes January 26th
Penn Treebank Bracketing guidelines
Ungraded Homework Exercise • Search for NP trace relative clauses as defined below: Be ready to compare search pattern and number found next time in class
Ungraded Homework Exercise @NP < @NP < @SBAR 12038
Ungraded Homework Exercise @NP < @NP < @SBAR plus WH indices 10956 down from 12038
Ungraded Homework Exercise @NP < @NP < (@SBAR < /^-NONE-/) 529 Note -NONE- < *ICH*
Ungraded Homework Exercise Not all @NP < @NP < (@SBAR < /^-NONE-/) are relative clauses
Ungraded Homework Exercise @NP < @NP < (@SBAR < /^-NONE-/) plus *ICH* count drops from 529 to 166
Ungraded Homework Exercise @NP < @NP < (@SBAR < /^-NONE-/) plus *ICH* Is 166 too low? How about other -NONE- nodes?
Ungraded Homework Exercise • Final tally
Homework Exercise Use the bracketing guides and choose three “interesting” constructions Find all occurrences in the WSJ PTB
Homework Exercise • 581 Homework rules • Due next lecture • Present your findings in class (slides)
Parsing … from Treebank search to stochastic parsers trained on the WSJ Penn Treebank
Bikel Collins • Java re-implementation of Collins’ parser • Paper • Daniel M. Bikel. 2004. Intricacies of Collins’ Parsing Model. (PS) (PDF) in Computational Linguistics, 30(4), pp. 479-511. • http://www.cis.upenn.edu/~dbikel/papers/collins-intricacies.pdf • Software • http://www.cis.upenn.edu/~dbikel/
Bikel Collins • Download and install Dan Bikel’s parser • File: install.sh • Java code • but at this point I think Windows won’t work because of the shell script (.sh) • maybe after files are extracted?
Bikel Collins • Download and install the POS tagger MXPOST parser doesn’t actually need a separate tagger…
Bikel Collins • Training the parser with the WSJ PTB • See guide • http://www.cis.upenn.edu/~dbikel/download/dbparser/guide.pdf directory: TREEBANK_3/parsed/mrg/wsj chapters 02-21: create one single .mrg file events: wsj-02-21.obj.gz
Bikel Collins • Settings:
Bikel Collins • Parsing • Command • Input file format (sentences)
Bikel Collins • Verify the trainer and parser work on your machine
Bikel Collins • File: bin/parse is a shell script that sets up program parameters and calls java
Bikel Collins • File: bin/train is another shell script
Bikel Collins • Relevant WSJ PTB files
Bikel Collins • If you have tcl/tk installed, I use a wrapper to call Dan Bikel’s code makes it easy to work the parser without memorizing the command line options
Bikel Collins • For tree viewing, you can use tregex For demos, I use my own viewer
Bikel Collins • POS tagging (MXPOST, in directory jmx) • tagger_input • $prefix/jmx/mxpost $prefix/jmx/tagger.project < /tmp/test.txt 2> /tmp/err.txt • Parsing • set ddf "wsj-02-21.obj.gz” • set properties "collins.properties" • parser_input • $dbprefix/bin/parse 400 $dbprefix/settings/$properties $dbprefix/bin/$ddf /tmp/test2.txt 2>@ stdout • Training • set mrg "wsj-02-21.mrg” • set properties "collins.properties" • $dbprefix/bin/train 800 $dbprefix/settings/$properties $dbprefix/bin/$mrg 2>@ stdout Unix file descriptors 0 Standard input (stdin) • Standard output (stdout) • Standard error (stderr) GUI components frame .input text .input.t -height 4 -yscrollcommand {.input.s set} scrollbar .input.s -command {.input.tyview} frame .tagged text .tagged.t -height 9 -yscrollcommand {.tagged.s set} scrollbar .tagged.s -command {.tagged.tyview} Code proc tagger_input {} { set lines [.input.t get 1.0 end] set infile [open "/tmp/test.txt" w] puts -nonewline $infile [string trimright $lines] close $infile } proc parser_input {} { set lines [.tagged.t get 1.0 end] set infile [open "/tmp/test2.txt" w] puts -nonewline $infile [string trimright $lines] close $infile }