830 likes | 982 Views
Natural Language Processing. Overview of this unit. Week 1 Natural Language Processing Work in partners on lab with NLTK Brainstorm and start projects using either or both NLP and speech recognition Week 2 Speech Recognition Speech lab Finish projects and short critical reading Week 3
E N D
Overview of this unit • Week 1 Natural Language Processing • Work in partners on lab with NLTK • Brainstorm and start projects using either or both NLP and speech recognition • Week 2 Speech Recognition • Speech lab • Finish projects and short critical reading • Week 3 • Present projects • Discuss reading
Natural Language Processing • What is “Natural Language”?
Components of Language • Phonetics
Components of Language • Phonetics – the sounds which make up a word • ie. “cat” – k a t
Components of Language • Phonetics • Morphology
Components of Language • Phonetics • Morphology – The rules by which words are composed • ie. Run + ing
Components of Language • Phonetics • Morphology • Syntax
Components of Language • Phonetics • Morphology • Syntax - rules for the formation of grammatical sentences • ie. "Colorless green ideas sleep furiously.” • Not "Colorless ideas green sleep furiously.”
Components of Language • Phonetics • Morphology • Syntax • Semantics
Components of Language • Phonetics • Morphology • Syntax • Semantics – meaning • ie. “rose”
Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics
Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics - relationship of meaning to the context, goals and intent of the speaker • ie. “Duck!”
Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse
Components of Language • Phonetics • Morphology • Syntax • Semantics • Pragmatics • Discourse – 'beyond the sentence boundary'
Natural Language Processing • Truly interdisciplinary
Natural Language Processing • Truly interdisciplinary • Probabilistic methods
Natural Language Processing • Truly interdisciplinary • Probabilistic methods • APIs
NLTK • Natural Language Toolkit for Python
NLTK • Natural Language Toolkit for Python • Text not speech
NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers…
NLTK • Natural Language Toolkit for Python • Text not speech • Corpora, tokenizers, stemmers, taggers, chunkers, parsers, classifiers, clusterers… words = book.words() bigrams = nltk.bigrams(words) cfd = nltk.ConditionalFreqDist(bigrams) pos = nltk.pos_tag(words)
Terminology • Token - An instance of a symbol, commonly a word, a linguistic unit
Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines
Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.”
Terminology • Tokenize – to break a sequence of characters into constituent parts • Often uses a delimiter like whitespace, special characters, newlines • “The quick brown fox jumped over the log.” • “Mr. Brown, we’re confused by your article in the newspaper regarding widely-used words.”
Terminology • Lexeme – The set of forms taken by a single word; main entries in a dictionary • ex: run [ruhn] verb, ran run runs running, noun, run, adjective, runny
Terminology • Morpheme - the smallest meaningful unit in the grammar of a language • Unladylike • Dogs • Technique
Terminology • Sememe – a unit of meaning attached to a morpheme • Dog - A domesticated carnivorous mammal • S – A plural marker on nouns
Terminology • Phoneme - the smallest contrastive unit in the sound system of a language • /k/ sound in the words kit and skill • /e/ in peg and bread • International Phonetic Alphabet (IPA)
Terminology • Lexicon - A Vocabulary, a set of a language’s lexemes
Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck”
Terminology • Lexical Ambiguity - multiple alternative linguistic structures can be built for the input • ie. “I made her duck” • We use POS tagging and word sense disambiguation to ATTEMPT to resolve these issues
Terminology • Part of Speech - how a word is used in a sentence
Terminology • Grammar – the syntax and morphology of a natural language
Terminology • Corpus/Corpora - a body of text which may or may not include meta-information such as POS, syntactic structure, and semantics
Terminology • Concordance – list of the usages of a word in its immediate context from a specific text • >>> text1.concordance(“monstrous”)
Terminology • Collocation – a sequence of words that occur together unusually often • ie. red wine • >>> text4.collocations()
Terminology • Hapax – a word that appears once in a corpus • >>> fdist.hapaxes()
Terminology • Bigram – sequential pair of words • From the sentence fragment “The quick brown fox…” • (“The”, “quick”), (“quick”, “brown”), (“brown”, “fox…”)
Terminology • Frequency Distribution – tabulation of values according to how often a value occurs in a sample • ie. Word frequency in a corpus • Word length in a corpus • >>> fdist = FreqDist(samples)
Terminology • Conditional Frequency Distribution – tabulation of values according to how often a value occurs in a sample given a condition • ie. How often is a word tagged as a noun compared to a verb • >>> cfd = nltk.ConditionalFreqDist(tagged_corpus)
Tagging • POS tagging
Types of taggers - Default • Default – tags everything as a noun • Accuracy - .13
Types of taggers - RE • Regular Expression – Uses a set of regexes to tag based on word patterns • Accuracy = .2
Types of taggers - Unigram • Unigram – learns the best possible tag for an individual word regardless of context • ie. Lookup table • NLTK example accuracy = .46 • Supervised learning
Types of taggers - Unigram • Based on conditional frequency analysis of a corpus • P (word | tag) • ie. What is the probability of the word “run” having the tag “verb”
Types of taggers – N gram • Ngram tagger – expands unigram tagger concept to include the context of N previous tokens • Including 1 previous token in bigram • Including 2 previous tokens is trigram
Types of taggers – N gram • N-gram taggers use Hidden Markov Models • P (word | tag) * P (tag | previous n tags) • ie. the probability of the word “run” having the tag “verb” * the probability of a tag “verb” given that the previous tag was “noun”