160 likes | 276 Views
LING 581: Advanced Computational Linguistics. Lecture Notes January 19th. Course. Webpage http://dingo.sbs.arizona.edu/~sandiway /ling581-11/ Enrollment. Course Objectives. Gain meaningful project experience dealing with natural language software packages installation
E N D
LING 581: Advanced Computational Linguistics Lecture Notes January 19th
Course • Webpage • http://dingo.sbs.arizona.edu/~sandiway/ling581-11/ • Enrollment
Course Objectives • Gain meaningful project experience • dealing with natural language software packages • installation • input data formatting • operation • project exercises • useful “real-world” computational experience • write small programs • abilities gained will be of value to employers
Computational Facilities • Advise using your own laptop/desktop • we can also make use of this computer lab • but you don’t have installation rights on these computers • Platforms • You need to run some variant of Unix… (your task #1 for this week) e.g. • Linux • de facto standard for advanced/research software • Cygwin on Windows • http://www.cygwin.com/ • Linux-like environment for Windows making it possible to port software running on POSIX systems (such as Linux, BSD, and Unix systems) to Windows. • MacOS X • Not quite Linux, some porting issues, especially with C programs
Theme • Language Understanding
Project Topics • PTB (Penn Treebank) search/lookup software (tgrep2), • Part-of-speech taggers. • The use and modification of statistical parsers trained on Treebanks (Bikel-Collins, and others) • Ontologiesand Semantic Networks: WordNet etc. • Question-Answering (QA) • Sentence Parsing using contemporary linguistic theory: Minimalist Program
Grading • Completion of all homework tasks will result in a satisfactory grade (A)
In the News recently… www.ibmwatson.com
You will be exposed to Perl Java Lisp s-exps Bikel-Collins Parser You will need to review concepts from LING 538 regexp use Penn POS tags Project 1: PTB
PTB • Availability • Linguistic Data Consortium (LDC) • U. of Arizona is a (fee-paying) member of this consortium • Resources are made available to the community through the main library • URL • http://sabio.library.arizona.edu/search/X
PTB (V3) • Call Record
Task 1 • Install cygwin or ubuntu • Install the PTB • Borrow it from the library • Or use the cd I’ve brought with me • Familiarize yourself with the organization and layout of the files • e.g. the difference between mrg and prd formats • As is standard in the literature, we’ll be using the WSJ (Wall Street Journal) section of the PTB
00/wsj_0001.mrg ( (S (NP-SBJ (NNP Mr.) (NNP Vinken) ) (VP (VBZ is) (NP-PRD (NP (NN chairman) ) (PP (IN of) (NP (NP (NNP Elsevier) (NNP N.V.) ) (, ,) (NP (DT the) (NNP Dutch) (VBG publishing) (NN group) ))))) (. .) )) 00/wsj_0001.mrg ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) TreeBank Browsing
TreeBank Browsing • My out-dated tool (treebank viewer) • URL • http://dingo.sbs.arizona.edu/~sandiway/treebankviewer/
PTB Search Tools Looking ahead • Google and Install • tgrep2 • http://tedlab.mit.edu/~dr/Tgrep2/ • a fast command line search tool for parse trees • C program (source, Makefile) • Tregex • http://nlp.stanford.edu/software/tregex.shtml • Graphical java version • Penn Treebank Online (tgrep interface) • http://www.ldc.upenn.edu/ldc/online/treebank/ • doesn’t seem to be working tgrepsearch currently unavailable.. • tgrep • VP << /^believe/ < (S < (/^NP/ !<< /[*]/ !< (-NONE- < T)) < (VP|AUX << to)) • approximation to finding Verb Phrases headed by "believe" that have an infinitival complement with a non-null subject