610 likes | 913 Views
CSC 439/539 Statistical Natural Language Processing Lecture 1: Introduction. Mihai Surdeanu Fall 2017. Take-away. Why you should take this course Admin issues First homework due in 1 week! What topics will be covered in this class?. Language is hard …. “Beating up” other languages.
E N D
CSC 439/539 Statistical Natural Language ProcessingLecture 1: Introduction Mihai Surdeanu Fall 2017
Take-away • Why you should take this course • Admin issues • First homework due in 1 week! • What topics will be covered in this class?
“Beating up” other languages • Why do we eat “pork” and “beef” but we raise “pigs” and “cows”? • What is the percentage of cognates with French in English?
Who did what to whom? “Our company is training workers.” Correct: “is training” as a verb group
Who did what to whom? Incorrect: “training” as gerund, as in: “Our problem is training workers.”
Who did what to whom? Incorrect: “training” modifies “workers, as in: “Those are training wheels.”
Ambiguity and selectional preferences I swallowed a bug while running. What selectional preferences would you add for the verb “swallow”? I swallowed his story, hook, line, and sinker. The supernova swallowedthe planet.
Variability Slide by Yoav Goldberg
NLP in a nutshell Slide by Yoav Goldberg
NLP ApplicationsQuestion Answering • When athletes begin to exercise, their heart rates and respiration rates increase. At what level of organization does the human body coordinate these functions? • A: at the tissue level • B: at the organ level • C: at the system level • D: at the cellular level • Unsolved problem! • Needs inference • Very little training data
Machine reading/Information extraction Human domain experts are around here
And many others… • Can you suggest a few other NLP applications?
Overview • Administration • First homework • Course overview
Instructor information • Instructor: Mihai Surdeanu • Email: msurdeanu@email.arizona.edu • Office: Gould-Simpson 746 • Office hours: Tue 12:30 - 2 • TAs:
Websites • Website/syllabus: • http://surdeanu.info/mihai/teaching/ling4539-fall17/index.php • But all material will be in D2L • Discussions on Piazza: • https://piazza.com/arizona/fall2017/ling439539/home
Prerequisites • Know how to program and have a decent understanding of data structures such as hash maps and trees. Have a basic understanding of computational linguistics: • Ling 438/538 or CSC 483/583 • Ideally, Math 129 (Calc 2). However, we will cover the necessary math in class.
The options • Python • “Official” language in this course • Java • Scala
Python • Pros: • Clean syntax • Popular: many NLP/ML libraries exist • Clean exception handling • Easy access to GPUs (for deep learning) • Cons: • Slow (when not on GPU) • Dynamically types • No great IDE
Java • Pros: • Pretty fast • Probably the most common language for serious NLP • Clean exception handling • Statically typed • Garbage collection • Several great IDEs • Cons: • Syntax too verbose • Inconsistent semantics due to enforced backwards compatibility (primitive types vs. objects, equality, etc.)
Scala • Pros: • Pretty fast • ``Hot'' language for IR, NLP, ML, distributed computing, web development • Clean, transparent exception handling • Clean, minimalist syntax • Consistent semantics • Statically typed • Garbage collection • At least one great IDE (IntelliJ • Fully compatible with Java (use all Java libraries) • Cons: • It has some “dark corners” • Backwards compatibility not guaranteed • No deep learning library native to Scala
Performance comparison More benchmarks: http://benchmarksgame.alioth.debian.org/u64/benchmark.php?test=all&lang=all&data=u64
Textbook http://nlp.stanford.edu/fsnlp I will provide all the other additional materials.
Final project • Implement a complete solution of a relevant NLP application or component. • You can choose your own, but each must be validated by the instructor. • For example:
Late work + attendance policy • Late work is not accepted, except in case of documented emergency approved by the instructor • Attendance is required • Students who miss class due to illness or emergency are required to bring documentation
Cooperation and cheating • Students are encouraged to share intellectual views and discuss freely the principles and applications of course materials. However, graded work/exercises must be the product of independent effort unless otherwise instructed. • We will use methods for plagiarism detection! • Students who violate the code of academic integrity should expect a penalty that is greater than the value of the work in question up to and including failing the course. • A record of the incident will be sent to the Dean of Students office. If you have been involved in other Code violations, the Dean of Students may impose additional sanctions.
Undergraduate vs. graduate requirements • This course will be co-convened. To differentiate between graduate and undergraduate students, the instructor will require graduate students to implement more complex algorithms for the programming project. Similarly, assignments and exams will have additional requirements/questions for graduate students. • The overall grading scheme will be the same between graduate and undergraduate students.
Overview • Administration • First homework • Course overview
First homework • Due Sunday night (8/27)! • Let’s take a look
Overview • Administration • First homework • Course overview
Part 1: Text categorization and a crash course in machine learning
Algorithms for classification: from kNN to feed-forward neural networks
Algorithms for classification: from kNN to feed-forward neural networks