1.14k likes | 1.4k Views
Introduction & Information Theory. Ling570 Advanced Statistical Methods in NLP January 3 , 2012. Roadmap. Course Overview Information theory. Course Overview. Course Information. Course web page: http://courses.washington.edu/ ling572. Course Information. Course web page:
E N D
Introduction & Information Theory Ling570 Advanced Statistical Methods in NLP January3, 2012
Roadmap • Course Overview • Information theory
Course Information • Course web page: • http://courses.washington.edu/ling572
Course Information • Course web page: • http://courses.washington.edu/ling572 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised
Course Information • Course web page: • http://courses.washington.edu/ling572 • Syllabus: • Schedule and readings • Links to other readings, slides, links to class recordings • Slides posted before class, but may be revised • Catalyst tools: • GoPost discussion board for class issues • CollectItDropbox for homework submission and TA comments • Gradebook for viewing all grades
GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions
GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run?
GoPost Discussion Board • Main venue for course-related questions, discussion • What not to post: • Personal, confidential questions; Homework solutions • What to post: • Almost anything else course-related • Can someone explain…? • Is this really supposed to take this long to run? • Key location for class participation • Post questions or answers • Your discussion space: Michael & I will not jump in often
GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost!
GoPost • Emily’s 5-minute rule: • If you’ve been stuck on a problem for more than 5 minutes, post to the GoPost! • Mechanics: • Please use your UW NetID as your user id • Please post early and often ! • Don’t wait until the last minute • Keep up with the GoPost– hard to use retrospectively • Notifications: • Decide how you want to receive GoPost postings
Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost
Email • Should be used only for personal or confidential issues • Grading issues, extended absences, other problems • General questions/comments go on GoPost • Please send email from your UW account • Include Ling572 in the subject • If you don’t receive a reply in 24 hours (48 on weekends), please follow-up
Homework Submission • All homework should be submitted through CollectIt • Tar cvf hw1.tar hw1_dir • Homework due 11:45 Thursdays • Late homework receives 10%/day penalty (incremental) • Most major programming languages accepted • C/C++/C#, Java, Python, Perl, Ruby • If you want to use something else, please check first • Please follow naming, organization guidelines in HW • All programming assignments should run on the CL cluster under Condor
Homework Assignments • (Mostly) Implementation tasks designed to get hands-on understanding of ML approaches • Focus on core concepts, not minute optimizations • If gold standard achieves 90.7%, 89.8% is okay • Not scored directly on efficiency, but.. • If it’s too slow, hard to debug, test, etc • Not scored on optimal software design either • Try to avoid hardcoding, but don’t need complex design
Grading • Homework assignments: 80% • Reading assignments: 10% • Class participation: 10% • No midterm or final exams • One homework assignment may be dropped
Grades • Grades in Catalyst Gradebook • TA feedback returned through CollectIt
Grades • Grades in Catalyst Gradebook • TA feedback returned through CollectIt • Extensions: only for extreme circumstances • Illness, family emergencies • Incomplete: only if all work completed up last two weeks • UW policy
Workload • CLMS courses carry a heavy workload • Ling572 is no exception
Workload • CLMS courses carry a heavy workload • Ling572 is no exception • Estimates (per week): • ~3 hours: Lecture • 10-12 hours: Homework assignments • Highly variable, depending on prior programming exp. • 1-3 hours: Reading + reading assignments
Workload • CLMS courses carry a heavy workload • Ling572 is no exception • Estimates (per week): • ~3 hours: Lecture • 10-12 hours: Homework assignments • Highly variable, depending on prior programming exp. • 1-3 hours: Reading + reading assignments • Tracking: • GoPost thread for each assignment: please post • Consider automatic time tracker (e.g. ‘hamster’ for linux)
Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class
Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions
Recordings • All classes will be recorded • Links to recordings appear in syllabus • Available to all students, DL and in class • Please remind me to: • Record the meeting (look for the red dot) • Repeat in-class questions • Note: Instructor’s screen is projected in class • Assume that chat window is always public
Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 12:30-1:30(afterTreehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect
Contact Info • Gina: Email: levow@uw.edu • Office hour: • Fridays: 12:30-1:30(afterTreehouse meeting) • Location: Padelford B-201 • Or by arrangement • Available by Skype or Adobe Connect • TA: Michael Wayne Goodman: • Email: goodmami@uw.edu • Office hour: Time: TBD, see GoPost • Location: Treehouse
Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C
Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C • Online attendance for in-class students • Not more than 2 times per term (e.g. missed bus, ice)
Online Option • Please check you are registered for correct section • CLMS in-class: Section A • State-funded: Section B • CLMS online: Section C • Online attendance for in-class students • Not more than 2 times per term (e.g. missed bus, ice) • Please enter meeting room 5-10 before start of class • Try to stay online throughout class
Online Tip • If you see: • You are not logged into Connect. The problem is one of the following: the permissions on the resource you are trying to access are incorrectly set.Please contact your instructor/Meeting Host/etc. • you do not have a Connect account but need to have one. For UWEO students: • If you have just created your UW NetID or just enrolled in a course • ….. • Clear your cache, close and restart your browser
Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, …. • Ling 570 (or similar)
Course Prerequisites • Programming Languages: • Java/C++/Python/Perl/.. • Operating Systems: Basic Unix/linux • CS 326 (Data structures) or equivalent • Lists, trees, queues, stacks, hash tables, … • Sorting, searching, dynamic programming,.. • Stat 391 (Probability and statistics): random variables, conditional probability, Bayes’ rule, …. • Ling 570 (or similar) • If you haven’t taken Ling570 or Ling472, please email me.
Textbook • No textbook • Online readings
Textbook • No textbook • Online readings • Reference / Background: • Jurafskyand Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 2nd edition, 2008 • Available from UW Bookstore, Amazon, etc • Manning and Schutze, Foundations of Statistical Natural Language Processing • Early edition available online through UW library
Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results
Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results • Focus on classification and sequence labeling
Course Goals • Understand the basis of machine learning algorithms that achieve state-of-the-art results • Focus on classification and sequence labeling • Concentrate on basic concepts of machine learning techniques and application to NLP tasks • Not a computational learning theory class • Won’t focus on proofs
Model Questions • Machine learning algorithms • Decision trees and naïve bayes • MaxEnt and Support Vector Machines • ….
Model Questions • Machine learning algorithms • Decision trees and naïve bayes • MaxEnt and Support Vector Machines • …. • Key questions • What is the model? • What assumptions does the model make? • How many parameters does the model have?
Model Questions • Training: How are the parameters learned? • Decoding: How does the model assign values?
Model Questions • Training: How are the parameters learned? • Decoding: How does the model assign values? • Pros and Cons: • How does the model handle… • outliers? missing data? noisy data? • Is it scalable? • How long does it take to train? decode? • How much training data is needed? Labeled? Unlabeled?
Tentative Outline for Ling572 • Unit #0 (0.5 weeks): Basics • Introduction • Information theory • Classification review
Outline for Ling572 • Unit #0 (0.5 weeks): Basics • Introduction • Information Theory • Classification review • Unit #1 (3 weeks): Classic Machine Learning • K Nearest Neighbors • Decision Trees • Naïve Bayes • Perceptrons (?)
Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines
Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning
Outline for Ling572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning • Unit #5: (1 week): Other Topics • Semi-supervised learning,…
Outline for Ling572 • Topics: • Feature selection approaches • Beam search • Toolkits: • Mallet, libSVM • Using binary classifiers for multiclass classification
Early NLP • Early approaches to Natural Language Processing • Similar to classic approaches to Artificial Intelligence
Early NLP • Early approaches to Natural Language Processing • Similar to classic approaches to Artificial Intelligence • Reasoning, knowledge-intensive approaches