Statistical Natural Language Processing

Lecture 1 19/4/2011 Statistical Natural Language Processing

Outline • Syllabus • Why take this class • Introduction to NLP

Course web page • http://www.staff.zu.edu.eg/hmabobakr/userdownloads/post/CSE_620_ST_NLP.html • Reference site: • www.u.arizona.edu/~echan3/539.html

Syllabus • Instructor • Instructor: Hitham M. Abo Bkr • Office hours: Wednesday 11:30 – 12:00 • E-mail: hithamab@yahoo.com

Schedule office hours • I will create a doodle poll and e-mail the link to you • I am teaching 3 classes so there may be students from other classes in office hours

http://www.doodle.ethz.ch/graphics/doodlePollReunion.png

Course description • 3 Units. This course will introduce students to the computational methods used in modern Statistical Natural Language Processing: corpora, principles of machine learning, statistical models of linguistic structure, and evaluation of system performance. Many applications will be presented: parsing, language modeling, part of speech tagging, sentiment analysis, machine translation, word sense disambiguation, information extraction, and others.

Computational linguistics concepts that you should have seen • Corpus • Part of speech tag • N-gram • Regular expression • Word sense • Syntactic constituency • Phrase structure tree • Context-free grammar • Parsing • Elementary probability theory • Smoothing • These topics will be reviewed in this class as needed, but you should also read about them in the Jurafsky & Martin textbook

Coursework • Assignments: there will be up to six assignments, involving short answer questions and programming questions. Some assignments will involve using existing software, and others will require coding from scratch. Assignments will be reduced in size for students enrolled in 439. Students without substantial programming experience may work together in pairs, with the consent of the instructor. • There will not be any exams.

Grading • Each assignment will be given between 0 and 20 points. Late assignments will not be accepted. • The overall score for the course will be weighted according to these criteria: assignments 70%, attendance 30%. The course grade will be A, B, C, D, or E. Incompletes will not be offered. Grades for students enrolled in 439 and 539 will be calculated separately.

Programming • Proficiency in programming is assumed for this course. Some of the lectures and scripts for assignments may use the Python language and Numerical Python (contained within the numpy module). Numerical Python will be covered in class. Lectures on the Python language are available here: http://www.u.arizona.edu/~echan3/508.html • For assignments, students may use any programming language. It is recommended that a “mainstream” language be used.

“Required” textbooks • Daniel Jurafsky and James H. Martin. 2008. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Second Edition. Pearson/Prentice-Hall. • Stephen Marsland. 2009. Machine Learning: An Algorithmic Perspective. CRC Press. • http://www-ist.massey.ac.nz/smarsland/MLbook.html

Other books: machine learning • Christopher M. Bishop. 2006. Pattern Recognition and Machine Learning. Springer. • Richard O. Duda, Peter E. Hart, and David G. Stork. 2001. Pattern Classification, Second Edition. Wiley-Interscience. • Trevor Hastie, Robert Tibshirani, and Jerome Friedman. 2001. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer. • Stuart Russell and Peter Norvig. 2003. Artificial Intelligence: a Modern Approach, Second Edition. Pearson / Prentice-Hall. • Tom Mitchell. 1997. Machine Learning. WCB / McGraw-Hill. • Christopher Manning, Prabhakar Raghavan, Hinrich Schütze. 2008. Introduction to Information Retrieval, Cambridge University Press. • Thomas M. Cover and Joy A. Thomas. 1991. Elements of Information Theory. John Wiley & Sons.

An older NLP textbook • Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. 6th printing with corrections, 2003. The MIT Press. • Available for free through the UA library (see link on course web page)

Additional readings • Many lectures will refer to conference and journal papers, which will be linked to on the course web page. Students may find these papers useful as supplemental reading material.

For linguists • More linguists are using computational or data-based methods these days • Model the mind as a computer • But most linguists were not educated as computer scientists • In this class you’ll learn the principles behind NLP technologies, and how to use them appropriately http://www.angry-monk.com/transblog/wp-content/gallery/article-pics/mammalian-brain-computer-inside.jpg

Get a job • Computational linguistics jobs • http://linguistlist.org/jobs/browse-job-current-rs-1.html • http://languagelog.ldc.upenn.edu/nll/?p=1067 • Software engineering jobs, looking for “natural language processing” • http://jobsearch.monster.com/PowerSearch.aspx?q=natural%20language%20processing&tm=60&sort=dt.rv&rad=20&rad_units=miles

NLP in the news • Commercially important area of research: sentiment analysis • http://www.nytimes.com/2009/08/24/technology/internet/24emotion.html?hpw • Expertise in the structure of other languages is needed as search engines adapt to other languages besides English • http://www.nytimes.com/2008/12/31/technology/internet/31hindi.html?_r=1&ref=business

NLP in the news • Automated trading programs that monitor the news, blog posts, twitter, etc. • http://www.nytimes.com/2010/12/23/business/23trading.html?hpw • Many of the computational techniques in statistical natural language processing can be applied to other domains, such as biotechnology and finance • http://www.nytimes.com/2009/08/06/technology/06stats.html

JANUARY 14, 2011 • Computer Conquers 'Jeopardy!' • http://online.wsj.com/article/SB10001424052748704307404576080333201294262.html

Name of field • Terms can be used interchangeably: • Computational linguistics • Natural Language Processing • Linguists say “computational linguistics” • Computer scientists and engineers like “Natural Language Processing”

Some NLP applications • Machine translation • Understanding text • e.g. persons, their relationships, and their actions • Search engines, answering users’ questions • Speech recognition • Modeling human language learning and processing

Low-level NLP tasks • Tokenizing a document • Sentence segmentation • Morphological analysis • Part of speech tagging • Syntactic parsing • Semantic analysis

Developing NLP systems • Before 1990: create systems by hand, by writing symbolic rules and grammars • Chomsky hierarchy • Regular: • Regular expressions, regular grammars, finite-state automata • Rewrite rules, Finite-state transducers • Context-free: • Context-free grammars, pushdown automata • Computational complexity of recognition

Example: recognition of names • Hillary Clinton • Hillary Diane Clinton • Hillary D. Clinton • Secretary of State Clinton • Hillary • H D Clinton • Hillary Rodham Clinton • H D R C • Mrs. Clinton • A name consists of an optional title and one of the following. One, a first name or initial, an optional middle name (which is either a first name, last name, or initial), and any number of last names. Two, an optional first name or initial, an optional middle name, and at least one last name.

Difficult cases • Foreign names • Wen Jiabao • Mikhail Khodorkovsky • Abdul Aziz bin Abdur Rahman Al Saud • Capitalization • hillaryclinton • HILLARY CLINTON • Ambiguities • Clinton, S.C. = location • rich baker = person or phrase? • john = person or toilet?

Problems with hand-written rules • Large number of rare names • Very difficult to provide coverage of all cases • Rules too general • Need information about how string appears in sentence • She became a rich baker by selling cupcakes. • Very difficult to specify exact combination of conditions for precise recognition • Hard to maintain system • Rule-based systems can get very, very large

1990s onward: statistical NLP • Availability of large annotated corpora • Corpus: electronic file of language usage • Annotations: linguistic structure is indicated in the corpus • Note: corpus = singular, corpora = plural • Not “corpuses” • Benefits of corpora • Catalogs actual language usage • Use to train machine learning algorithms • Quantitative evaluation

Example of a corpus: Brown corpus • Henry Kucera and W. Nelson Francis, 1967, Computational Analysis of Present-Day American English. • 500 texts, about 1.2 millon words • “Balanced” corpus: texts from 15 different genres • Newpapers, editorials, literature, science, government documents, cookbooks, etc. • http://en.wikipedia.org/wiki/Brown_Corpus

Brown corpus: each word is followed by a slash and a code indicating part of speech tag Miami/np-hl ,/,-hl Fla./np-hl ,/,-hl March/np-hl 17/cd-hl --/-- The/at Orioles/nps tonight/nr retained/vbd the/at distinction/nn of/in being/beg the/at only/ap winless/jj team/nn among/in the/at eighteen/cd Major-League/nn-tl clubs/nns as/cs they/ppss dropped/vbd their/pp$ sixth/od straight/jj spring/nn exhibition/nn decision/nn ,/, this/dt one/cd to/in the/at Kansas/np-tl City/nn-tl Athletics/nns-tl by/in a/at score/nn of/in 5/cd to/in 3/cd ./.

Example of a corpus: Penn Treebank • 1.3 million words of Wall Street Journal articles • Manually annotated for syntactic structure • http://www.cis.upenn.edu/~treebank/ • Example sentence: • Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.

Format of sentences in Treebank file ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))

Can draw a phrase structure tree

Build systems through machine learning   http://www.ibiblio.org/hhalpin/homepage/presentations/semsearch09/brainsky.jpg Annotated corpus Machine learning algorithm NLP system

General approach to (supervised) machine learning • Formulate problem in terms of the output labels to be predicted given input data • Have an annotated corpus containing labels for its data • Use the corpus to train a classifier • Apply the classifier to predict labels for new data

Algorithms for supervised classification • Many algorithms to predict the label of an item based on its features: • Perceptron • Decision Trees • Naïve Bayes • Neural Networks • Maximum Entropy • Support Vector Machines • Memory-Based Learning • Margin Infused Relaxed Algorithm • etc.

Often not specific to language • Most common machine learning algorithms can be applied to any domain: • Handwriting recognition • Interpreting visual scene • Predicting stock prices • Identifying proteins in DNA sequences • Though some algorithms were designed specifically for language • Utilize CFGs, finite automata, and other formalisms

Benefits of ML over hand-written rules • Fast system development (assuming a corpus) • Higher performance • Higher coverage of linguistic possibilities: rare constructions are encountered in corpora • Can handle ambiguity in language

“I saw a man who was on a hill and who had a telescope.” “I was on the hill that has a telescope when I saw a man.” “I saw a man who was on the hill that has a telescope on it.” “Using a telescope, I saw a man who was on a hill.” “I was on the hill when I used the telescope to see a man.” I saw the man on the hill with a telescope Me See A man The telescope The hill

Dealing with ambiguity • Real-life sentences can easily have millions of parses • Not a solution to output all possible parses • Statistical NLP: • Count frequencies of different constructions in a corpus • Assign a probability to a parse • Output most likely parse

Real examples of ambiguous sentences • KIDS MAKE NUTRITIOUS SNACKS • STOLEN PAINTING FOUND BY TREE • LUNG CANCER IN WOMEN MUSHROOMS • QUEEN MARY HAVING BOTTOM SCRAPED • DEALERS WILL HEAR CAR TALK AT NOON • MINERS REFUSE TO WORK AFTER DEATH • MILK DRINKERS ARE TURNING TO POWDER • DRUNK GETS NINE MONTHS IN VIOLIN CASE • JUVENILE COURT TO TRY SHOOTING DEFENDANT • COMPLAINTS ABOUT NBA REFEREES GROWING UGLY

Statistical NLP has its challenges, too • Expense of annotating corpora • Penn Treebank: 1.2 million words, $1 per word in 1990 • Languages of lesser commercial importance • Difficult to obtain data • Hard to get funding to pay for annotation • Domain specificity: a classifier trained on a particular corpus will not necessarily work well on another

Machine learning with less annotated data • We’ll also look at unsupervised and semi-supervised learning algorithms • Utilize corpora without annotations, or with only a small amount of annotation • Linguistic structure is built into the learner instead of explicitly indicated in the data

Statistical Natural Language Processing