1 / 23

Tools for Natural Language Processing Applications

Tools for Natural Language Processing Applications. Guruprasad Saikumar & Kham Nguyen. OUTLINE. Natural Language Toolkit Part of Speech Taggers Parsers Language Modeling Back-off N-gram Language Model. Natural Language Toolkit.

Download Presentation

Tools for Natural Language Processing Applications

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tools for Natural Language Processing Applications Guruprasad Saikumar & Kham Nguyen

  2. OUTLINE • Natural Language Toolkit • Part of Speech Taggers • Parsers • Language Modeling • Back-off N-gram Language Model

  3. Natural Language Toolkit • Created as part of a computational linguistics course in the Dept of Comp & Info. Science, University of Pennsylvania ( 2001). • Natural Language Toolkit (NLTK) can be used as • Teaching tool • Individual study tool • Platform for prototyping and building research systems • NLTK is organized as a flat hierarchy of packages and modules

  4. NLTK contents: • Python Modules • Tutorials • Problem sets • Reference documentation • Technical documentation • Current NLTK Modules: • Basic operations like tokenization, tree structure, etc. • Tagging • Parsing • Visualization Useful link http://nltk.sourceforge.net/

  5. Parts of Speech TAGGERS

  6. Stanford Log-linear Part-of-speech Tagger • By Kristina Toutanova • Maximum Entropy based POS Tagger • Java implementation • Two trained Tagger models for English, using Penn Treebank tag set Link to download software: http://nlp.stanford.edu/software/tagger.shtml References: • Kristina Toutanova and Christopher D. Manning. Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger. Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), Hong Kong. • Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003 pages 252-259.

  7. Tree Tagger • Institute for Computational Linguistics of the University of Stuttgart. • Based on modified version of ID3 decision tree algorithm • Tagger for languages – English, German, French, Italian, Spanish, Greek p(NN|DET,ADJ) Download link: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html

  8. Brill Tagger • Developed by Eric Brill • Transformation based POS Tagger • The rule-based part of speech tagger works by first assigning each word its most likely tag • Rules are learned to use contextual cues to improve tagging accuracy Download Link http://www.cs.jhu.edu/~brill/RBT1_14.tar.Z References: • Some advances in rule-based part of speech tagging, Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, Wa., 1994.

  9. TnT –Trigrams’n Tags • A statistical POS Tagger • Developed by Thorsten Brants, Saarland University. • Implementation of Viterbi algorithm for second order Markov models • Language models available for German and English. • Tagger can be adapted to new languages. Useful Link: http://www.coli.uni-saarland.de/~thorsten/tnt/

  10. Parsers

  11. Stanford Parser • Contributions mainly by Dan Klein with support code and linguistic grammar development by Christopher Manning • Java implementation of probabilistic natural language parser. • Online parser link http://josie.stanford.edu:8080/parser/ Download link http://www-nlp.stanford.edu/downloads/StanfordParser-2005-07-21.tar.gz Reference: Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a Factored Model for Natural Language Parsing. In Advances in Neural Information Processing Systems 15 (NIPS 2002), December 2002.

  12. Language Modeling

  13. SRI Language Modeling Toolkit • SRILM was mainly developed by Andreas Stolcke, Speech Technology and Research Laboratory, CA. • SRILM is a collection of C++ libraries, executable programs and helper scripts • Main application - statistical modeling for speech recognition • LM based on n-gram statistics Reference “SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. Conf. Spoken Language Processing, Denver, Colorado, September 2002 Download Link: http://www.speech.sri.com/projects/srilm/download.html

  14. Text analysis and summarization Tools System Quirk • Text analysis • Generate word lists • indexing • Tracker Download link http://www.mcs.surrey.ac.uk/SystemQ/ MEAD • summarization and evaluation tool • Some features of the tool are • Multiple document summarization • Query based summarization • Various evaluation methods. Download link http://tangra.si.umich.edu/clair/mead/download/MEAD-3.07.tar.gz

  15. Back-off N-gram Language Model Kham Nguyen (kham@ccs.neu.edu) Spring 2006 Northeastern University

  16. Outline • Statistical language modeling • What is an N-gram language model • Back-off N-gram Northeastern University

  17. What is “statistical language modeling”? • A statistical language model (LM) provides a mechanism for computing: P(<s>, w1, w2, …, wn, </s>) • Used in speech recognition • Also used in machine translation, language identification, etc Northeastern University

  18. N-gram Language Model • The simplest form of an N-gram probability is just the relative frequency of the N-gram Northeastern University

  19. Back-off N-gram Language Model • Necessary for unseen N-grams by “backing off” to lower order N-grams • For simplicity, abbreviate the history (w1, w2,..,wn-1) as h, and (w2,..,wn-1) as h’ • n=1: unigram • n=2: bigram • n=3: trigram Northeastern University

  20. Back-off weight • A back-off N-gram is: • The probability axiom requires: • Define 2 disjoint sets of wi’s: • -BO(wi|h): set of all wi that wi|h seen in training data, and • BO(wi|h): set of all wi that wi|h unseen in training data Northeastern University

  21. Back-off weight (cont.) Northeastern University

  22. Perplexity • The “quality” of an LM is typically measured by its perplexity, or the “branching” factor • Basically, the perplexity is the average numbers of words that can appear after a history Northeastern University

  23. N-gram LM for Speech Recognition • Language Model is one of the knowledge sources used in automatic speech recognition (ASR) • Almost all State-of-the-art ASR systems use Back-off N-gram LM, N typically is 3 (or trigram) Northeastern University

More Related