1 / 18

CSA2050: Introduction to Computational Linguistics

CSA2050: Introduction to Computational Linguistics. NLTK. NLTK. A software package for manipulating linguistic data and performing NLP tasks Advanced tasks are possible from an early stage Permits projects at various levels Consistent interfaces Facilitates reusability of modules

binah
Download Presentation

CSA2050: Introduction to Computational Linguistics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSA2050: Introduction to Computational Linguistics NLTK CSA2050:NLTK

  2. NLTK • A software package for manipulating linguistic data and performing NLP tasks • Advanced tasks are possible from an early stage • Permits projects at various levels • Consistent interfaces • Facilitates reusability of modules • Implemented in Python CSA2050:NLTK

  3. Chart Parsing with NLTK CSA2050:NLTK

  4. Why Python • Popular languages for NLP courses • Prolog (clean, learning curve, slow) • Perl (quick, syntax). • Why Python is better suited • Easy to learn, clean syntax • Interpreted, supporting rapid prototyping • Object oriented • Powerful CSA2050:NLTK

  5. NLTK Structure • NLTK is implemented as a set of minimally independent modules. • Core modules • Basic data types • Task Modules • Tokenising • Parsing • Other NLP tasks CSA2050:NLTK

  6. Token Class • The token class to encode information about NL texts. • Each token instance represents a unit of text such as a word, a text, or a document. • A given instance is defined by a partial mapping from property names to property values. CSA2050:NLTK

  7. The TEXT Property • The TEXT property is used to encode a token’s text content. >>> from nltk.token import * >>> Token(TEXT="Hello World!") <Hello World!> CSA2050:NLTK

  8. TAG • The TAG property is used to encode a token’s part of speech tag: >>> Token(TEXT="python",TAG="NN") <python/NN> CSA2050:NLTK

  9. SUBTOKENS • The SUBTOKENS property is used to store a tokenized text: >>> from nltk.tokenizer import * >>> tok = Token(TEXT="Hello World!") >>> WhitespaceTokenizer().tokenize(tok) >>> print tok[’SUBTOKENS’]) [<Hello>, <World!>] CSA2050:NLTK

  10. Augmenting the Tokenwith Information • Language processing tasks are formulated as annotations and transformations involving tokens which add properties to the Token data structure. • word-sense disambiguation • chunking • parsing CSA2050:NLTK

  11. Blackboard Architecture • Typically these modifications are monotonic – they add information but do not delete it. • Tokens serve as a blackboard where information about a piece of text is collated. • This architecture contrasts with the more typical pipeline architecture where each stage destructively modifies the input information. • This approach was chosen because it gives greater flexibility when combining tasks into a single system. CSA2050:NLTK

  12. Other Core Modules • probability module defines classes for probability distributions and statistical smoothing techniques. • cfg module defines classes for encoding context free grammars (normal and probabilistic) • The corpus module defines classes for reading and processing different corpora. CSA2050:NLTK

  13. Using Brown Corpus >>> from nltk.corpus import brown >>> brown.groups() [’skill and hobbies’, ’popular lore’, ’humor’, ’fiction: mystery’, ...] >>> brown.items(’humor’) (’cr01’, ’cr02’, ’cr03’, ’cr04’, ’cr05’, ’cr06’, ’cr07’, ’cr08’, ’cr09’) >>> brown.tokenize(’cr01’) <[<It/pps>, <was/bedz>, <among/in>, <these/dts>, <that/cs>, <Hinkle/np>, <identified/vbd>, <a/at>, ...]> CSA2050:NLTK

  14. Penn Treebank >>> from nltk.corpus import treebank >>> treebank.groups() (’raw’, ’tagged’, ’parsed’, ’merged’) >>> treebank.items(’parsed’) [’wsj_0001.prd’, ’wsj_0002.prd’, ...] >>> item = ’parsed/wsj_0001.prd’ >>> sentences = treebank.tokenize(item) >>> for sent in sentences[’SUBTOKENS’]: ... print sent.pp() # pretty-print (S: (NP-SBJ: (NP: <Pierre> <Vinken>) (ADJP: (NP: <61> <years>) <old> ) ... CSA2050:NLTK

  15. Processing Modules • Each language processing algorithm is implemented as a class. • For example, the ChartParser and Recu rsiveDescentParser classes each define a single algorithm for parsing a text. • Each processing module defines an interface. • Interface classes are named with a trailing capital i, e.g. ParserI. • Such interface classes define one or more action methods that perform the task the module is supposed to perform. CSA2050:NLTK

  16. parse method parse_n method CSA2050:NLTK

  17. What is Python • Python is an interpreted, object-oriented, programming language with dynamic semantics. • Attractive for Rapid Application Development • Easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. • Python supports modules and packages, which encourages program modularity and code reuse. • Developed by Guido van Rossum in the early 1990s • Named after Monty Python • Open Source and free. • Download from www.python.org CSA2050:NLTK

  18. Why Python • Prolog • clean, learning curve, slow • Lisp • old, syntax, big • Perl • quick, • C# CSA2050:NLTK

More Related