510 likes | 686 Views
Shallow Processing: Recap Domain Adapt. Shallow Processing Techniques for NLP Ling570 Day 21 - December 6, 2012. Roadmap. MT & Domain Adapt Looking back: Topics covered Tools and Data Looking forward Upcoming courses. Domain Adaptation. Slides 4-19 adapted from Barry Haddow’s slides.
E N D
Shallow Processing:Recap Domain Adapt Shallow Processing Techniques for NLP Ling570 Day 21 - December 6, 2012
Roadmap • MT & Domain Adapt • Looking back: • Topics covered • Tools and Data • Looking forward • Upcoming courses
Domain Adaptation Slides 4-19 adapted from Barry Haddow’s slides
The Battle of Word Senses and Domains • Most words have multiple senses • Cross-lingual mapping difficult for all contexts • Senses are often “domain” specific Fun with Tables and Chairs (into German): • Table • Tisch • Tabelle • Chair • Stuhl • Vorsitzende
The Battle of Word Senses and Domains • Most words have multiple senses • Cross-lingual mapping difficult for all contexts • Senses are often “domain” specific Fun with Tables and Chairs (into German): • Table • Tisch • Tabelle • Chair • Stuhl • Vorsitzende General usage Tech usage General usage Governmental usage
The Battle of Word Senses and Domains : Contexts • Table The food is on the table. Das Essen ist auf dem Tisch. The results are in the table. Die Ergebnisse sind in der Tabelle. • Chair He sat on the chair. Er saß auf dem Stuhl. He is chair of the committee. Er ist Vorsitzender des Ausschusses.
Domain? • Not a well-defined concept • Should be based on some notion of textual similarity • Lexical choice • Grammar • What level of granularity? • News Sports Football
Examples of Domains • Europarl (ep) • European parliamentary proceedings • News-commentary (nc) • Analysis of current affairs • Subtitles (st) • Film subtitles
Europarl Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period. Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. You have requested a debate on this subject in the course of the next few days, during this part-session.
News Commentary Musharraf's Last Act? Desperate to hold onto power, Pervez Musharraf has discarded Pakistan's constitutional framework and declared a state of emergency. His goal? To stifle the independent judiciary and free media. Artfully, though shamelessly, he has tried to sell this action as an effort to bring about stability and help fight the war on terror more effectively.
Subtitles I'il call in 30 minutes to check Oh, hello Fujio Is your mother here, too? Why are you outside? It's no fun listening to women's talk Well, why don't we go in together
Translation Performance • Measure effect of domain on performance • Train two systems • One using in-domain data (nc or st) • One using out-of-domain data (ep) • Test on in-domain data (nc or st) • Only vary translation model
NC vs. EP: Example • Source: • Veamos el mercado accionario de los Estados Unidos, el mayor del mundo pormucho. • Translation - ep • Consider the US accionario market, the world's largest by much. • Translation - nc • Consider the US stock market, the largest by far.
Domain Adaptation Techniques • Data selection • Filtering • Data weighting • Corpus weighting • Model interpolation • Topic-based clustering and weighting • Phrase or sentence weighting • Enlarge data-set • Web crawling • Extract from monolingual • Self-training
Data Selection • Suppose we have: • A small in-domain corpus I • A large out-of-domain corpus O • Select data from O which is similar to I • Equivalent to weighting the sentences with a 1-0 weighting • Would be better not to discard data...
Data Selection for Translation Models • Modfied Moore-Lewis (Axelrod et al, EMNLP 2011, & Axelrod et al, IWSLT 2012) • Moore-Lewis based on perplexity difference between I and O • For TM, apply this to source and target sentence • Selects sentence most like I and least like O
Does it work? Axelrod et al 2011 See also Axelrod et al 2012 for comparisons between domain adapted (10%) versus data (100%) system counterparts
Units #0 • Unit #0 (0.5 weeks): • HW #1 • Introduction to NLP & shallow processing • Tokenization
Unit #1 • Unit #1 (0.5 weeks): • Formal Languages and Automata (1 week) • Formal languages • Finite-state Automata • Finite-state Transducers • Morphological analysis • Transition from PFA’s to MMs
Unit #2 • Unit #2 (2 weeks): • HW #2, #3 • HW #2 – Building Markov models for English • HW #3 – Building a Korean POS HMM tagger • Markov Chains and HMMs • Building and applying Markov models • Part-of-speech (POS) tagging: • Ngram • Hidden Markov Models
Unit #3 • Unit #3: Intro to Classification (1.5 weeks) • Project #1 – News blog classification • Classification & Machine Learning • Intro to classification & Mallet • Intro to feature engineering • Document classification with classifiers
Unit #4 • Unit #4: Language Models & Smoothing (1.5 weeks) • HW #4 – Building LMs and applying smoothing • HW #5 – Building LMs, applying KL divergence to compare models • Intro to Language Models • Intro to Smoothing techniques • Laplace (Add 1) • Good Turing
Unit #5 • Unit #5: QA & IR (1 week) • Introduction to QA & IR • Applying NLP methods to QA & IR • Reviewing data “pipelines” for NLP and related tasks
Units #6 • Unit #6: Discriminative sequence modeling (1.5 weeks) • Project #2 – Applying disc models to POS tagging • POS tagging with classifiers • Chunking • Named Entity (NE) recognition
Unit #7 • Unit #7: Misc topics in Stat NLP (2weeks) • Introduction to IE • Application of IE to “linguistics” • Introduction to MT • NLP models and techniques as applied to MT • Word Alignment • Intro to EM algorithm • Domain Adaptation as applied to MT
Tools Developed • English tokenizer: HW#1 • Markov Models from corpora: HW#2 • Building a transition matrix • Building an emission matrix • Korean POS tagger (using an HMM): HW#3 • Apply #2 to Korean data • Simple smoothing • Text classifier: Project #1 • Classifier of blog/news data, right vs. left • Language Modeler: HW#4 • Tool to build and smooth an LM • Applied to Portuguese data • Tools for calculating Entropy and KL Divergence: HW#5 • Building and smoothing mulitlingual LMs • How to compare LMs and distributions • Discriminative POS Tagger: Project #2 • Korean POS Tagger, part 2 • ML applied to Sequence Labeling problems
Corpora & Systems • Data:
Corpora & Systems • Data: • Penn Treebank • Wall Street Journal • Air Travel Information System (ATIS)
Corpora & Systems • Data: • Penn Treebank • Wall Street Journal • Air Travel Information System (ATIS) • Korean Treebank • Portuguese Newswire Text Corpus • LM Training files from Cavnar & Trenkle (multiple languages) • Online Blogs from various media sites • Systems: • Mallet Machine Learning Package • Porter Stemmer
Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation
Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,…
Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,… • Ling 567: Knowledge Engineering for Deep NLP • HPSG and MRS for novel languages
Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,… • Ling 567: Knowledge Engineering for Deep NLP • HPSG and MRS for novel languages • Ling 575: • (Xia) Domain adaptation • Dealing with system degradation when training and test data are from different domains • (Tjalve) Speech Technologies • (Bender) Semantic Representations • (Levow) Spoken Dialog Systems (?), in Spring
Tentative Outline for Ling 572 • Unit #0 (1 week): Basics • Introduction • Feature representations • Classification review
Tentative Outline for Ling 572 • Unit #0 (1 week): Basics • Introduction • Feature representations • Classification review • Unit #1 (2.5 weeks): Classic Machine Learning • K Nearest Neighbors • Decision Trees • Naïve Bayes
Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines
Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning
Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning • Unit #5: (1 week): Other Topics • Semi-supervised learning,…
Ling 572 Information • No required textbook: • Online readings and articles
Ling 572 Information • No required textbook: • Online readings and articles • More math/stat content than 570 • Probability, Information Theory, Optimization
Ling 572 Information • No required textbook: • Online readings and articles • More math/stat content than 570 • Probability, Information Theory, Optimization • Please try to register at least 2 weeks in advance
Beyond Ling 572 • Machine learning: • Graphical models • Bayesian approaches • Online learning • Reinforcement learning • ….
Beyond Ling 572 • Machine learning: • Graphical models • Bayesian approaches • Online learning • Reinforcement learning • …. • Applications: • Information Retrieval • Question Answering • Generation • Machine translation • ….
Ling 575: Domain Adaptation • Handling system degradation when training and test data are from different domains • Focus on improving performance of POS taggers and parsers • Time: Thurs 3:30-5:50pm