LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 23 • 4/10/2013

Recommended reading • Zellis Harris. 1954. From phoneme to morpheme. • Jenny R. Saffran, Richard N. Aslin, and Elissa L. Newport. 1996. Statistical learning by 8-month-old infants. Science, 274, 1926-1928. • Timothy Gambell and Charles Yang. 2005. Word segmentation: quick but not dirty. MS. • Daniel Hewlett and Paul Cohen. 2011. Word segmentation as general chunking. Proceedings of CoNLL. • Rie K. Ando and Lillian Lee. 2003. Mostly-unsupervised statistical segmentation of Japanese kanji sequences. Natural Language Engineering, 9(2).

Outline • Introduction to unsupervised learning • Word segmentation by letter successors • Word segmentation by transitional probability • Word segmentation using vocabulary bootstrapping and stress information • Word segmentation as general chunking • Word segmentation in Chinese and Japanese

Types of machine learning • Supervised learning • Data is annotated for the label to be predicted • Learn a mapping from features to labels • Semi-supervised learning • Partially annotated data • Unsupervised learning • Data is not annotated for the concept to be learned • Use features to group similar data together

Why unsupervised learning? • Annotated data is expensive • Unsupervised learning can be used to discover structure in new data sets • Categories learned in an unsupervised manner may be useful as features in supervised learning • Gold-standard annotations do not necessarily directly reflect statistical properties of data • e.g. Nonterminal rewriting in parsing • Model child language acquisition • Children learn in an unsupervised manner

Applications of unsupervised learning in NLP • Unsupervised induction of: • Word segmentation • Morphology • POS categories • Word collocations • Semantic categories of words • Paraphrase discovery • etc. • “Induction” = discover from scratch

Computational approaches to unsupervised learning • Algorithms: k-means clustering, agglomerative clustering, mutual information clustering, singular value decomposition, probabilistic models • Computational issues: representation, search space, Minimal description length, data sparsity, Zipf’s law

Linguistic issues: learning bias • Unsupervised learning is interesting from a linguistic point of view because it involves both rationalist and empiricist approaches to language • Empiricist • Knowledge is obtained from experience (=data) • Rationalist • Knowledge results from the capacities of the mind • Learning bias: learner is predisposed to acquire certain end results due to how it was programmed • Learning cannot be entirely “knowledge-free”

Linguistic issues: language specificity • Empiricist perspective: opposed to using linguistics in NLP, for the sake of adhering to linguistic theory • View language as just one of many kinds of data • Apply general-purpose learning algorithms that are applicable to other kinds of data • Language-specific learning algorithms are not necessary • If successful, strengthens claims that linguistic theory isn’t needed

First application of unsupervised learning: word segmentation • Word segmentation problem: howdoyousegmentacontinuousstreamofwords? • Use statistical regularities between items in sequence to discover word boundaries • Look at 5 different approaches, from different fields • Old-school Linguistics • Psychology • Computational Linguistics • Artificial Intelligence • Applied NLP

Applications of word segmentation • Speech recognition • Break acoustic signal (which is continuous) into phonemes / morphemes / words • Languages written without spaces • Asian languages • Decipher ancient texts • Language acquisition • How children identify words from continuous speech • Identify morphemes in sign language

Zellig Harris • “Structuralist” linguist • Pre-Chomsky; was Chomsky’s advisor • Proposed automatic methods for a linguist to discover the structure of a language • Theories are based on, and account for observed data only • Do not propose abstract representations • Don’t use introspection

Harris 1954: Letter successors • Have a sequence of phonemes, don’t know where the boundaries are • Idea: morpheme/word boundaries occur where there are many possible letter successors • Resembles entropy, but is more primitive • Example: successors of he’s: • he’s crazy • he’s quiet • he’s careless

(from A. Albright) • Segment he’s quicker: hiyzqwIker • # of letter successors at each position • hI: 14 • hIy: 29 • hIyz: 29 • hIyzk: 11 • Propose boundary at local maximum • hIy

Backtracking • When the successor count drops to zero, go back to the previous peak and treat it as the start of a chunk

Segmentation algorithm • Calculate successor counts, with backtracking at zero successors • Segment at local maxima in successor counts • Results: • Lines are segmentation choices • Solid = true word boundary, dotted = morpheme boundary

Children figure out word segmentation • Normal conversation: continuous flow of speech, no pauses between words • One task in language acquisition is to figure out the word boundaries • How do children do it? • Bootstrap through isolated words • Phonetic/phonological constraints • Statistical approach: transitional probability

1. Bootstrap through isolated words • Old idea: • Children hear isolated words • doggy • Use these words to segment more speech • baddoggy • Problems: • How to recognize single-word utterances? • In English, only 9% of utterances are single words • Not number of syllables: spaghetti

2. Phonetic/phonological constraints • Phonotactics • Some sound combinations not allowable in English • zl, mb, tk • Could hypothesize word boundary here • However, could occur word-internally: embed • Articulatory cues • Aspirated vs. unaspirated t • tab vs. cat • Could use this knowledge to mark word boundaries • Problems: • This is from the adult point of view; how do children acquire this knowledge?

3. Use statistics: transitional probability • Transitional probability (same as conditional probability) • TP(AB) = p(AB) / p(A) B TP(AB) A C TP(AC) D TP(AD) • Idea: TP of syllables signals word boundaries • High TP within words, low TP across words • Example • pre.tty ba.by • TP(pretty) > TP(ttyba) • A child could use TP statistics to segment words

Saffran, Aslin, & Newport 1996 • Test whether children can track statistics of transitional probabilities • 8 months old • Artificial language • 4 consonants (p,t,b,d), 3 vowels (a, i, u) • 12 syllables (pa, ti, bu, da, etc.) • 6 words: babupu, bupada, dutaba, patubi, pidabu, tutibu • TP: 1.0 within words; 0.33 across words. • No effect of co-articulation, stress, etc. • Stimuli • 2 minutes of continuous stream of words • monotone voice, synthesized speech • bidakupadotigolabubidaku...

High TP High TP Low TP High TP High TP Low TP High TP High TP babupubu pa da du taba Word boundaries at dips in transitional probability

Testing children • Test stimuli • Same syllables • Novel words whose TPs are different from training stimuli • Test preference for highly frequent (training) or rare (test) words • Results Mean Listening times (seconds) Familiar Novel Matched-pairs t test 6.77 7.60 t(23) = 2.3, p < 0.03 • Conclusion: • Children are sensitive to transitional probabilities of syllables because they show a preference for novel stimuli

Head turning procedure Child light speakers

Conclusions • Supports idea that children learn to segment words through transitional probability statistics • Frequently used as an argument against innate knowledge in language acquisition • “Results raise the possibility that infants possess experience dependent mechanisms that may be powerful enough to support not only word segmentation but also the acquisition of other aspects of language.” • Other research shows that it’s not unique to humans • Monkeys can do this, too

Computational model of acquisition of word segmentation • Saffranet al. showed that children are sensitive to transitional probabilities • But does it mean that this is how children do it? • Test with a computational model • Precisely defined input/output and algorithm • Apply to a corpus: discover words from continuous sequence of phoneme symbols

Data • Portion of English CHILDES corpus • Transcriptions of adult / child speech • 226,178 words • 263,660 syllables • Corpus preparation • Take adult speech • Look up words in CMU pronunciation dictionary, which has stress indicated • cat K AE1 T • catapult K AE1 T AH0 P AH0 L T • Apply syllabification heuristics • Remove spaces between words

Gambell & Yang 2005: test TP • Test syllable TP, without stress • Propose word boundaries at local minima in TP • i.e., propose word boundary between AB and CD if TP(AB) > TP(BC) < TP(CD) • Results • Precision: 41.6% • Recall: 23.3% • TP doesn’t work!

Problems with Saffran et al. study • Artificial language is too artificial • Very small vocabulary • All words are 3 syllables • TPs used by Saffran et al. are 1 and 0.33 only. • Why TP doesn’t work • Sparse data: 54,448 different syllable pairs • TP requires multisyllable words! • Single-syllable words: no within-word TP • In corpus, single-syllable word followed by single-syllable word 85% of time

TP is weak as a cognitive model • Computationally complex • Huge amount of TPs to keep track of • Can’t be psychologically plausible • TP doesn’t work on corpus  kids can’t be using just TP • Not linguistically motivated

Gambell & Yang 2005:Use stress for segmentation • Unique Stress Constraint: • A word can bear, at most, one primary stress • Assumed innate, part of Universal Grammar • Darth-Va-der S1 S2 W S = strong, W = weak • Segment between stressed syllables: [Darth] [Va-der]

Use stress for segmentation • Automatically identifies single-syllable stressed words • What about Chew-ba-cca? W S W • And in a sequence with one or more weak syllables between two strong syllables, where is the word boundary? S W WW S

Model 1: SL + USC(SL = Statistical Learning = TP) • Input: transcribed speech with stress • Training: calculate transitional probabilities • Testing • Scan sequence of syllables • If two strong syllables, propose a word boundary • If multiple weak syllables between strong, propose word boundary where TP is lowest • Performance • Precision = 73.5%, Recall = 71.2%

Models 2 and 3: vocabulary bootstrapping • Bootstrapping • Use known words to segment unknown ones • Iterative process that builds up vocabulary • 3 cases for segmenting S W S • [ S W ] S [ S W ] is a known word • S [ W S ] [ W S ] is a known word • S W S unknown • No transitional probability!

Models 2 and 3 • Problem: • S W WW S no known words • Model 2: Algebraic agnostic • Just skip these cases • Might segment later if word identified from a later iteration • Model 3: Algebraic random • Randomly choose word boundary

Conclusion • Utilizes linguistic knowledge from UG • Unique Stress Constraint • Result: no massive storage of TPs is necessary • Problems • How does child identify stress??? • What about unstressed words? • Function words are often reduced

Word segmentation as a general chunking problem • Algorithms for segmentation can also be applied to non-linguistic data • Voting Experts algorithm (Paul Cohen, U of A) • Word segmentation can be accomplished by algorithms that are not specific to language • Don’t need to utilize language-specific information such as “stress”

Segmentation of robot behavior(non-linguistic data) • Robot wandered around a room for 30 minutes, examining objects • Robot had 8 different actions: • MOVE-FORWARD • TURN • COLLISION-AVOIDANCE • VIEW-INTERESTING-OBJECT • RELOCATE-INTERSTING-OBJECT • SEEK-INTERESTING-OBJECT • CENTER-CHASIS-ON-OBJECT • CENTER-CAMERA-ON-OBJECT • Segment into 5 different episodes, based on actions at each time step: • FLEEING • WANDERING • AVOIDING • ORBITING-OBJECT • APPROACHING-OBJECT

Characteristics of temporal chunks • Sequences are highly predictible within chunks, and unpredictible between chunks

Expert #1: segment according to frequency of substrings • Split a sequence so as to maximize the empirical frequency of subsequences, and a high proportion of splits will be word boundaries relative to an equal number of random splits • Example: split ‘THE’ and ‘AT’ THECATSATONTHEMATTOEATHERFOOD

Expert #2: segment according to boundary entropy • Split a sequence so as to maximize the empirical uncertainty of the next subsequence, and a high proportion of the splits will be word boundaries relative to an equal number of random splits • Example: after ‘AT’ entropy of next symbol is highest THECATSATONTHEMATTOEATHERFOOD

Count # of following letters with a trie • (from P. Cohen)

Voting • Each of the experts makes a vote at each point in the sequence • Segment where the number of votes is highest THECATSATONTHEMATTOEATHERFOOD 2 votes 2 votes 2 votes 2 votes

LING / C SC 439/539 Statistical Natural Language Processing