510 likes | 561 Views
Introduction to NLP. CSCI 544: Applied Natural Language Processing Nanyun (Violet) Peng Course website: https://violetpeng.github.io/cs544_fa19.html. syllabus, announcements, slides, homeworks. 1. Who you are?. Department Stage Goal. Goals of the field.
E N D
Introduction to NLP CSCI 544: Applied Natural Language Processing Nanyun (Violet) Peng Course website: https://violetpeng.github.io/cs544_fa19.html syllabus, announcements, slides, homeworks 1
Who you are? • Department • Stage • Goal
Goals of the field Computers would be a lot more useful if they could handle our email, do our library research, talk to us … But they are fazed by natural human language. How can we tell computers about language? (Or help them learn it as kids do?)
NLP is a pretty old topic! ACL founded : 1962 1900s
Launched 2006 App 2011 WordLens 2015
Where else have you seen NLP in your life, in the news, or elsewhere?
Actually, NLP of the present (as of May 2019)! https://www.youtube.com/watch?v=G_v5B_gYceM
NLP in action Word Sense Disambiguation Ambiguity Resolution Information Retrieval Grounding Generation Summarization Coreference Resolution Named Entity Recognition Spelling correction h/t Semantic Machines
A few applications of NLP • Spelling correction, grammar checking … • Better search engines • Information extraction • Psychotherapy; Storytelling; etc. • New interfaces: • Speech recognition (and text-to-speech) • Dialogue systems (USS Enterprise onboard computer) • Machine translation (the Babel fish)
Goals of the course • Introduce you to NLP problems & solutions • Relation to linguistics & statistics • At the end you should: • Agree that language is subtle & interesting • Feel some ownership over the formal & statistical models • Understand research papers in the field
Ambiguity: Favorite Headlines • Iraqi Head Seeks Arms • Juvenile Court to Try Shooting Defendant • Teacher Strikes Idle Kids • Stolen Painting Found by Tree • Kids Make Nutritious Snacks • Local High School Dropouts Cut in Half • Obesity Study Looks for Larger Test Group
What’s NLP? NL = {Mandarin Chinese, English, Spanish, Hindi, ..., Uyghur, ..., Urdu ...} Anybody speak a rare language? • Automation of • Understanding (NL -> R) • Generation (R -> NL) • Acquisition of R from knowledge and data. • What’s R?
Levels of Language • Phonetics/phonology/morphology: what words (or subwords) are we dealing with? • Syntax: What phrases are we dealing with? Which words modify one another? • Semantics: What’s the literal meaning? • Pragmatics: What should you conclude from the fact that I said something? How should you react?
Levels of Language • Phonetics/phonology/morphology: what words (or subwords) are we dealing with? • cat, cats, dog, dogs(z), box, boxes. • Syntax: What phrases are we dealing with? Which words modify one another? DT DT JJ JJ NN NN S How do words fit together? the the blue blue boat boat NP VP the blue boat sailed home Note similarity to programming languages!
Levels of Language • Semantics: What’s the literal meaning? What does a sentence mean? tool Papa eats the caviar with a spoon. agent agent papa-01: an informal term for a father with a spoon eat-01 eat-01: take in solid food caviar-01: salted roe of sturgeon or other large fish caviar-01 papa-01
Levels of Language • Pragmatics: What should you conclude from the fact that I said something? How should you react? What does the speaker mean? (context: in a stuffy room) Can you crack the window a little bit? No, I do not have a hammer. Yes, I have the ability to crack the window. Sure (open the window), does this feel better?
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there.
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. To get a donut (spare tire) for his car?
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. store where donuts shop? or is run by donuts? or looks like a big donut? or made of donut?
What’s hard about this story? I stopped smoking freshman year, but John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there.
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. Describes where the store is? Or when he stopped?
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. Well, actually, he stopped there from hunger and exhaustion, not just from work.
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. At that moment, or habitually? (Similarly: Mozart composed music.)
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. That’s how often he thought it?
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. But actually, a coffee only stays good for about 10 minutes before it gets cold.
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. Similarly: In America a woman has a baby every 15 minutes. Our job is to find that woman and stop her.
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. the particular coffee that was good every few hours? the donut store? the situation?
What’s hard about this story? John stopped at the donut store on his way home from work. He thought a coffee was good every few hours. But it turned out to be too expensive there. too expensive for what? what are we supposed to conclude about what John did? how do we connect “it” to “expensive”?
How do we (humans) do these tasks? • Spelling Correction • Named Entity Extraction • Question Answering • Coreference Resolution • Grounding • Ambiguity Resolution • Summarization • Translation Long story short, you know languages!
What You Will Learn In This Course (I hope) • How the NLP you interact with on a daily basis works • The various models, algorithms, and tools that are out there to solve language-related problems you want to tackle • How to design and evaluate your own task/model/algorithm/tool • Using data and statistics • Using your own creativity • Using the latest advances • The currently open research questions, so you can pursue further study
Questions about Intro? Next is Boring Course Stuff That Is Very Important
Why You May Not Want To Take This Course • It's not an easy course • Last year, 52% received an A or A-. But 10% received a B-! • This is after scaling. The unscaled mean grade was 81.3%. • If you "need" a particular grade in the course, see me now so you can be advised about how realistic your request is. • Be prepared to work • You should have worked with probability and statistics before • You should have taken calculus and linear algebra • You should know how to program at the level of a CS undergrad senior or better (python in particular) • We have a research/exploration oriented project • The exam questions will require original thought, not just memorization
Why You May Not Want To Take This Course (Cont.) • It's not necessarily "comfortable" • Late in the evening and no DEN (not under my control, you should email the department!) • Some of the assignments are competitive: your score is based on how well you do relative to your classmates • If we discover you have cheated, you'll probably get a C- or worse in the course
Cheating • You MAY • talk with other students, friends, or others about your homework assignments IF you acknowledge such discussion in your submission • ask questions about the homework and subject material in the forums • You MAY NOT • copy code or answers from any source including friends, homework/test services, NLP or other software libraries. This includes making slight changes to previously written code • Find solutions to these problems online • Share solutions to these problems online • hack the scoring servers, Kobayashi Maru-style • allow your code to be copied, even if unintentionally • attempt to communicate with or read from any other person or device while taking exams
Cheating • Unfortunately, about 10% of you will be caught cheating, based on previous experience. • Suspected cheats (including those who were plagiarized from) will be reported to the University. Punishment includes but is not limited to: • zero on assignment, exam, or class • Loss of career services privileges • Loss of CPT rights • Uncomfortable meeting with Lizsl
How we catch cheatings? MOSS! http://theory.stanford.edu/~aiken/moss/
The Team • Nanyun (instructor) • Specialized in Machine Learning, Information Extraction, Creative Language Generation, Morphology/Phonology • Rujun Han (TA) • Sachin Balakrishna (Grader) • Kamya Batra (Grader)
Support • DO NOT EMAIL US (sorry) • Piazza Discussion Forums • Signup at piazza.com/usc/fall2019/csci544, access code: 1544 • TAs and Instructor will monitor regularly, answer questions, engage in discussion • You should answer each others' questions and help your fellow students out! • Outstanding student contributors will receive a grade bump-up as extra credit (e.g. B+ to A-, A- to A) [see extra credit details in 8 slides] • Blackboard • Location of homework assignment pdfs • Ultimately, location of grades
Support (Cont.) • Vocareum • Homework test beds and code submission • Crowdmark • Homework written submission and graded exam review • Office Hours • Nanyun: Wed. 4:30-5:30pm (i.e. before class), RTH 512; possibly Friday the same time if there is demand • Rujun: Fri. 1-3pm, SAL lab
Syllabus and Schedule • https://violetpeng.github.io/cs544_fa19.html (bookmark this page!) • Schedule will probably change depending on our speed; check back frequently • No official textbook; readings will be posted on the class website • Unofficial "textbook" = Jurafsky and Martin, Speech and Language Processing 3rd edition; only available (in incomplete form) here: https://web.stanford.edu/~jurafsky/slp3/ • If you have specific issues preventing you from attending see/message me immediately • If you have medical/other accommodation needs see/message me ASAP (well before the midterm)
Lecture and Notes • You don't have to come to class if you don't want to/can't • If you do, please pay attention and participate! • However you are responsible for everything covered in class, and • It won't be recorded • Slides may not cover everything I discuss • Slides will be posted soon after class (possibly before, but they might be updated) on the course website. • I use a lot of slides from other classes and note this; feel free to self educate.
Prerequisites • I expect you to program at the level of a CS undergrad senior or better • Most of the assignments will be in Python • There will be basic probability and statistics, which will be reviewed as needed
Homeworks • 4 homeworks, 2-3 weeks to do each one, they do not overlap (i.e. you will not receive hw2 before hw1 is due). 10% of your grade each. • No homework will be assigned or due right before midterm/final • Mostly programming assignments submitted to Vocareum (you should have received an email opening your account, let us know if not!) • Some written assignments submitted electronically • You can have 6 late days total over the whole course... • ...but no more than 2 per assignment • Late homeworks thereafter are penalized by 50% for the next 24 hours, then not accepted
Project • There will be a carefully guided project! • Get a team (up to 4) – everyone gets the same grade • Write up what you will do • Submit preliminaries by " Project proposal due" date • Main project will likely consist of • Yelp data analysis • Event (relation) extraction • Consists of 20% of the grade! • Research oriented!
Exams • Midterm: October 4 in class • short answers, multiple choice, some derivations, some pencil & paper calculations • one double-sided page of notes allowed (may be prepared with other current 544 students) • Final: TBD • like the midterm, but can cover the whole class (emphasis will be on second half) • one double-sided page of notes allowed (may be prepared with other current 544 students) • Whichever one of these you score better on (as a percentage) will count toward 25% of your grade; the other will count toward 15% of your grade
Extra Credit • Outstanding forum contributors may have their grade bumped a category (e.g. B- to B, B+ to A-). • The number of and determination of such contributors is up to the staff and is not eligible for regrade. • Occasionally there may be extra credit points in a homework. These will offset other point losses in that homework (i.e. they do not affect other homeworks/exams and cannot result in a >100% score).
Grading/Regrading Policy • No changes are allowed to submitted homework (after the deadline) • If something is clearly wrong, you may request specific regrade of a specific question/part via a google form the TAs will send out. • WARNING: If you are just 'fishing' for points you may LOSE additional points. No grubs!
Next Time... • Probability concepts • n-grams, language models