Natural Language Processing

Natural Language Processing

Overview of this unit • Week 1 Natural Language Processing • Work in partners on lab with NLTK • Brainstorm and start projects using either or both NLP and speech recognition • Week 2 Speech Recognition • Speech lab • Finish projects and short critical reading • Week 3 • Present projects • Discuss reading

Natural Language Processing • What is “Natural Language”?

Components of Language • Phonetics

Components of Language • Phonetics – the sounds which make up a word • ie. “cat” – k a t

Components of Language • Phonetics • Morphology

Components of Language • Phonetics • Morphology – The rules by which words are composed • ie. Run + ing

Components of Language • Phonetics • Morphology • Syntax

Components of Language • Phonetics • Morphology • Syntax - rules for the formation of grammatical sentences • ie. "Colorless green ideas sleep furiously.” • Not "Colorless ideas green sleep furiously.”

Components of Language • Phonetics • Morphology • Syntax • Semantics

Components of Language • Phonetics • Morphology • Syntax • Semantics – meaning • ie. “rose”

Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics

Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics - relationship of meaning to the context, goals and intent of the speaker • ie. “Duck!”

Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse

Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse – 'beyond the sentence boundary'

Natural Language Processing • Truly interdisciplinary

Natural Language Processing • Truly interdisciplinary • Probabilistic methods

Natural Language Processing • Truly interdisciplinary • Probabilistic methods • APIs

NLTK • Natural Language Toolkit for Python

NLTK • Natural Language Toolkit for Python • Text not speech

NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers…

NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers… words = book.words() bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) pos = nltk.pos_tag(words)

Terminology

Terminology • Token - An instance of a symbol, commonly a word, a linguistic unit

Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines

Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.”

Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.” • “Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”

Terminology • Lexeme – The set of forms taken by a single word; main entries in a dictionary • ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny

Terminology • Morpheme - the smallest meaningful unit in the grammar of a language • Unladylike • Dogs • Technique

Terminology • Sememe – a unit of meaning attached to a morpheme • Dog - A domesticated carnivorous mammal • S – A plural marker on nouns

Terminology • Phoneme - the smallest contrastive unit in the sound system of a language • /k/ sound in the words kit and skill • /e/ in peg and bread • International Phonetic Alphabet (IPA)

Terminology • Lexicon - A Vocabulary, a set of a language’s lexemes

Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck”

Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck” • We use POS tagging and word sense disambiguation to ATTEMPT to resolve these issues

Terminology • Part of Speech - how a word is used in a sentence

Terminology • Grammar – the syntax and morphology of a natural language

Terminology • Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics

Terminology • Concordance – list of the usages of a word in its immediate context from a specific text • >>> text1.concordance(“monstrous”)

Terminology • Collocation – a sequence of words that occur together unusually often • ie. red wine • >>> text4.collocations()

Terminology • Hapax – a word that appears once in a corpus • >>> fdist.hapaxes()

Terminology • Bigram – sequential pair of words • From the sentence fragment “The quick brown fox…” • (“The”, “quick”), (“quick”, “brown”), (“brown”, “fox…”)

Terminology • Frequency Distribution – tabulation of values according to how often a value occurs in a sample • ie. Word frequency in a corpus • Word length in a corpus • >>> fdist = FreqDist(samples)

Terminology • Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition • ie. How often is a word tagged as a noun compared to a verb • >>> cfd = nltk.ConditionalFreqDist(tagged_corpus)

Tagging • POS tagging

Types of taggers - Default • Default – tags everything as a noun • Accuracy - .13

Types of taggers - RE • Regular Expression – Uses a set of regexes to tag based on word patterns • Accuracy = .2

Types of taggers - Unigram • Unigram – learns the best possible tag for an individual word regardless of context • ie. Lookup table • NLTK example accuracy = .46 • Supervised learning

Types of taggers - Unigram • Based on conditional frequency analysis of a corpus • P (word | tag) • ie. What is the probability of the word “run” having the tag “verb”

Types of taggers – N gram • Ngram tagger – expands unigram tagger concept to include the context of N previous tokens • Including 1 previous token in bigram • Including 2 previous tokens is trigram

Types of taggers – N gram • N-gram taggers use Hidden Markov Models • P (word | tag) * P (tag | previous n tags) • ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”

Natural Language Processing