180 likes | 321 Views
CSA2050: Introduction to Computational Linguistics. NLTK. NLTK. A software package for manipulating linguistic data and performing NLP tasks Advanced tasks are possible from an early stage Permits projects at various levels Consistent interfaces Facilitates reusability of modules
E N D
CSA2050: Introduction to Computational Linguistics NLTK CSA2050:NLTK
NLTK • A software package for manipulating linguistic data and performing NLP tasks • Advanced tasks are possible from an early stage • Permits projects at various levels • Consistent interfaces • Facilitates reusability of modules • Implemented in Python CSA2050:NLTK
Chart Parsing with NLTK CSA2050:NLTK
Why Python • Popular languages for NLP courses • Prolog (clean, learning curve, slow) • Perl (quick, syntax). • Why Python is better suited • Easy to learn, clean syntax • Interpreted, supporting rapid prototyping • Object oriented • Powerful CSA2050:NLTK
NLTK Structure • NLTK is implemented as a set of minimally independent modules. • Core modules • Basic data types • Task Modules • Tokenising • Parsing • Other NLP tasks CSA2050:NLTK
Token Class • The token class to encode information about NL texts. • Each token instance represents a unit of text such as a word, a text, or a document. • A given instance is defined by a partial mapping from property names to property values. CSA2050:NLTK
The TEXT Property • The TEXT property is used to encode a token’s text content. >>> from nltk.token import * >>> Token(TEXT="Hello World!") <Hello World!> CSA2050:NLTK
TAG • The TAG property is used to encode a token’s part of speech tag: >>> Token(TEXT="python",TAG="NN") <python/NN> CSA2050:NLTK
SUBTOKENS • The SUBTOKENS property is used to store a tokenized text: >>> from nltk.tokenizer import * >>> tok = Token(TEXT="Hello World!") >>> WhitespaceTokenizer().tokenize(tok) >>> print tok[’SUBTOKENS’]) [<Hello>, <World!>] CSA2050:NLTK
Augmenting the Tokenwith Information • Language processing tasks are formulated as annotations and transformations involving tokens which add properties to the Token data structure. • word-sense disambiguation • chunking • parsing CSA2050:NLTK
Blackboard Architecture • Typically these modifications are monotonic – they add information but do not delete it. • Tokens serve as a blackboard where information about a piece of text is collated. • This architecture contrasts with the more typical pipeline architecture where each stage destructively modifies the input information. • This approach was chosen because it gives greater flexibility when combining tasks into a single system. CSA2050:NLTK
Other Core Modules • probability module defines classes for probability distributions and statistical smoothing techniques. • cfg module defines classes for encoding context free grammars (normal and probabilistic) • The corpus module defines classes for reading and processing different corpora. CSA2050:NLTK
Using Brown Corpus >>> from nltk.corpus import brown >>> brown.groups() [’skill and hobbies’, ’popular lore’, ’humor’, ’fiction: mystery’, ...] >>> brown.items(’humor’) (’cr01’, ’cr02’, ’cr03’, ’cr04’, ’cr05’, ’cr06’, ’cr07’, ’cr08’, ’cr09’) >>> brown.tokenize(’cr01’) <[<It/pps>, <was/bedz>, <among/in>, <these/dts>, <that/cs>, <Hinkle/np>, <identified/vbd>, <a/at>, ...]> CSA2050:NLTK
Penn Treebank >>> from nltk.corpus import treebank >>> treebank.groups() (’raw’, ’tagged’, ’parsed’, ’merged’) >>> treebank.items(’parsed’) [’wsj_0001.prd’, ’wsj_0002.prd’, ...] >>> item = ’parsed/wsj_0001.prd’ >>> sentences = treebank.tokenize(item) >>> for sent in sentences[’SUBTOKENS’]: ... print sent.pp() # pretty-print (S: (NP-SBJ: (NP: <Pierre> <Vinken>) (ADJP: (NP: <61> <years>) <old> ) ... CSA2050:NLTK
Processing Modules • Each language processing algorithm is implemented as a class. • For example, the ChartParser and Recu rsiveDescentParser classes each define a single algorithm for parsing a text. • Each processing module defines an interface. • Interface classes are named with a trailing capital i, e.g. ParserI. • Such interface classes define one or more action methods that perform the task the module is supposed to perform. CSA2050:NLTK
parse method parse_n method CSA2050:NLTK
What is Python • Python is an interpreted, object-oriented, programming language with dynamic semantics. • Attractive for Rapid Application Development • Easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. • Python supports modules and packages, which encourages program modularity and code reuse. • Developed by Guido van Rossum in the early 1990s • Named after Monty Python • Open Source and free. • Download from www.python.org CSA2050:NLTK
Why Python • Prolog • clean, learning curve, slow • Lisp • old, syntax, big • Perl • quick, • C# CSA2050:NLTK