1 / 51

Shallow Processing: Recap Domain Adapt

Shallow Processing: Recap Domain Adapt. Shallow Processing Techniques for NLP Ling570 Day 21 - December 6, 2012. Roadmap. MT & Domain Adapt Looking back: Topics covered Tools and Data Looking forward Upcoming courses. Domain Adaptation. Slides 4-19 adapted from Barry Haddow’s slides.

danton
Download Presentation

Shallow Processing: Recap Domain Adapt

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Shallow Processing:Recap Domain Adapt Shallow Processing Techniques for NLP Ling570 Day 21 - December 6, 2012

  2. Roadmap • MT & Domain Adapt • Looking back: • Topics covered • Tools and Data • Looking forward • Upcoming courses

  3. Domain Adaptation Slides 4-19 adapted from Barry Haddow’s slides

  4. The Battle of Word Senses and Domains • Most words have multiple senses • Cross-lingual mapping difficult for all contexts • Senses are often “domain” specific Fun with Tables and Chairs (into German): • Table • Tisch • Tabelle • Chair • Stuhl • Vorsitzende

  5. The Battle of Word Senses and Domains • Most words have multiple senses • Cross-lingual mapping difficult for all contexts • Senses are often “domain” specific Fun with Tables and Chairs (into German): • Table • Tisch • Tabelle • Chair • Stuhl • Vorsitzende General usage Tech usage General usage Governmental usage

  6. The Battle of Word Senses and Domains : Contexts • Table The food is on the table. Das Essen ist auf dem Tisch. The results are in the table. Die Ergebnisse sind in der Tabelle. • Chair He sat on the chair. Er saß auf dem Stuhl. He is chair of the committee. Er ist Vorsitzender des Ausschusses.

  7. Domain? • Not a well-defined concept • Should be based on some notion of textual similarity • Lexical choice • Grammar • What level of granularity? • News Sports Football

  8. Examples of Domains • Europarl (ep) • European parliamentary proceedings • News-commentary (nc) • Analysis of current affairs • Subtitles (st) • Film subtitles

  9. Europarl Resumption of the session I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period. Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful. You have requested a debate on this subject in the course of the next few days, during this part-session.

  10. News Commentary Musharraf's Last Act? Desperate to hold onto power, Pervez Musharraf has discarded Pakistan's constitutional framework and declared a state of emergency. His goal? To stifle the independent judiciary and free media. Artfully, though shamelessly, he has tried to sell this action as an effort to bring about stability and help fight the war on terror more effectively.

  11. Subtitles I'il call in 30 minutes to check Oh, hello Fujio Is your mother here, too? Why are you outside? It's no fun listening to women's talk Well, why don't we go in together

  12. Translation Performance • Measure effect of domain on performance • Train two systems • One using in-domain data (nc or st) • One using out-of-domain data (ep) • Test on in-domain data (nc or st) • Only vary translation model

  13. Test on nc

  14. Test on st

  15. NC vs. EP: Example • Source: • Veamos el mercado accionario de los Estados Unidos, el mayor del mundo pormucho. • Translation - ep • Consider the US accionario market, the world's largest by much. • Translation - nc • Consider the US stock market, the largest by far.

  16. Translations of “stock market”

  17. Domain Adaptation Techniques • Data selection • Filtering • Data weighting • Corpus weighting • Model interpolation • Topic-based clustering and weighting • Phrase or sentence weighting • Enlarge data-set • Web crawling • Extract from monolingual • Self-training

  18. Data Selection • Suppose we have: • A small in-domain corpus I • A large out-of-domain corpus O • Select data from O which is similar to I • Equivalent to weighting the sentences with a 1-0 weighting • Would be better not to discard data...

  19. Data Selection for Translation Models • Modfied Moore-Lewis (Axelrod et al, EMNLP 2011, & Axelrod et al, IWSLT 2012) • Moore-Lewis based on perplexity difference between I and O • For TM, apply this to source and target sentence • Selects sentence most like I and least like O

  20. Does it work? Axelrod et al 2011 See also Axelrod et al 2012 for comparisons between domain adapted (10%) versus data (100%) system counterparts

  21. Course Recap

  22. Units #0 • Unit #0 (0.5 weeks): • HW #1 • Introduction to NLP & shallow processing • Tokenization

  23. Unit #1 • Unit #1 (0.5 weeks): • Formal Languages and Automata (1 week) • Formal languages • Finite-state Automata • Finite-state Transducers • Morphological analysis • Transition from PFA’s to MMs

  24. Unit #2 • Unit #2 (2 weeks): • HW #2, #3 • HW #2 – Building Markov models for English • HW #3 – Building a Korean POS HMM tagger • Markov Chains and HMMs • Building and applying Markov models • Part-of-speech (POS) tagging: • Ngram • Hidden Markov Models

  25. Unit #3 • Unit #3: Intro to Classification (1.5 weeks) • Project #1 – News blog classification • Classification & Machine Learning • Intro to classification & Mallet • Intro to feature engineering • Document classification with classifiers

  26. Unit #4 • Unit #4: Language Models & Smoothing (1.5 weeks) • HW #4 – Building LMs and applying smoothing • HW #5 – Building LMs, applying KL divergence to compare models • Intro to Language Models • Intro to Smoothing techniques • Laplace (Add 1) • Good Turing

  27. Unit #5 • Unit #5: QA & IR (1 week) • Introduction to QA & IR • Applying NLP methods to QA & IR • Reviewing data “pipelines” for NLP and related tasks

  28. Units #6 • Unit #6: Discriminative sequence modeling (1.5 weeks) • Project #2 – Applying disc models to POS tagging • POS tagging with classifiers • Chunking • Named Entity (NE) recognition

  29. Unit #7 • Unit #7: Misc topics in Stat NLP (2weeks) • Introduction to IE • Application of IE to “linguistics” • Introduction to MT • NLP models and techniques as applied to MT • Word Alignment • Intro to EM algorithm • Domain Adaptation as applied to MT

  30. Tools & Data

  31. Tools Developed • English tokenizer: HW#1 • Markov Models from corpora: HW#2 • Building a transition matrix • Building an emission matrix • Korean POS tagger (using an HMM): HW#3 • Apply #2 to Korean data • Simple smoothing • Text classifier: Project #1 • Classifier of blog/news data, right vs. left • Language Modeler: HW#4 • Tool to build and smooth an LM • Applied to Portuguese data • Tools for calculating Entropy and KL Divergence: HW#5 • Building and smoothing mulitlingual LMs • How to compare LMs and distributions • Discriminative POS Tagger: Project #2 • Korean POS Tagger, part 2 • ML applied to Sequence Labeling problems

  32. Corpora & Systems • Data:

  33. Corpora & Systems • Data: • Penn Treebank • Wall Street Journal • Air Travel Information System (ATIS)

  34. Corpora & Systems • Data: • Penn Treebank • Wall Street Journal • Air Travel Information System (ATIS) • Korean Treebank • Portuguese Newswire Text Corpus • LM Training files from Cavnar & Trenkle (multiple languages) • Online Blogs from various media sites • Systems: • Mallet Machine Learning Package • Porter Stemmer

  35. Looking Forward

  36. Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation

  37. Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,…

  38. Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,… • Ling 567: Knowledge Engineering for Deep NLP • HPSG and MRS for novel languages

  39. Winter Courses • Ling 571: Deep Processing Techniques for NLP • Parsing, Semantics (Lambda Calculus), Generation • Ling 572: Advanced Statistical Methods in NLP • Roughly, machine learning for CompLing • Decision Trees, Naïve Bayes, MaxEnt, SVM, CRF,… • Ling 567: Knowledge Engineering for Deep NLP • HPSG and MRS for novel languages • Ling 575: • (Xia) Domain adaptation • Dealing with system degradation when training and test data are from different domains • (Tjalve) Speech Technologies • (Bender) Semantic Representations • (Levow) Spoken Dialog Systems (?), in Spring

  40. Tentative Outline for Ling 572 • Unit #0 (1 week): Basics • Introduction • Feature representations • Classification review

  41. Tentative Outline for Ling 572 • Unit #0 (1 week): Basics • Introduction • Feature representations • Classification review • Unit #1 (2.5 weeks): Classic Machine Learning • K Nearest Neighbors • Decision Trees • Naïve Bayes

  42. Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines

  43. Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning

  44. Tentative Outline for Ling 572 • Unit #3: (4 weeks): Discriminative Classifiers • Feature Selection • Maximum Entropy Models • Support Vectors Machines • Unit #4: (1.5 weeks): Sequence Learning • Conditional Random Fields • Transformation Based Learning • Unit #5: (1 week): Other Topics • Semi-supervised learning,…

  45. Ling 572 Information • No required textbook: • Online readings and articles

  46. Ling 572 Information • No required textbook: • Online readings and articles • More math/stat content than 570 • Probability, Information Theory, Optimization

  47. Ling 572 Information • No required textbook: • Online readings and articles • More math/stat content than 570 • Probability, Information Theory, Optimization • Please try to register at least 2 weeks in advance

  48. Beyond Ling 572 • Machine learning: • Graphical models • Bayesian approaches • Online learning • Reinforcement learning • ….

  49. Beyond Ling 572 • Machine learning: • Graphical models • Bayesian approaches • Online learning • Reinforcement learning • …. • Applications: • Information Retrieval • Question Answering • Generation • Machine translation • ….

  50. Ling 575: Domain Adaptation • Handling system degradation when training and test data are from different domains • Focus on improving performance of POS taggers and parsers • Time: Thurs 3:30-5:50pm

More Related