LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing • Lecture 26 • 4/22/2013

Recommended reading • http://en.wikipedia.org/wiki/Cluster_analysis • Martin Redington, Nick Chater, and Steven Finch. 1998. Distributional information: a powerful cue for acquiring syntactic categories. Cognitive Science, 22(4), 425-469. • Toben H. Mintz, Elissa L. Newport, and Thomas Bever. 2002. The distributional structure of grammatical categories in speech to young children. Cognitive Science, 26, 393-425. • Marie Labelle. 2005. The acquisition of grammatical categories: a state of the art. In Henri Cohen & Claire Lefebvre (eds.), Handbook of Categorization in Cognitive Science, Elsevier, 433-457. • E. Chan dissertation, Chapter 6

Outline • POS induction and language acquisition • Agglomerative clustering • Results of agglomerative clustering on POS induction • Failure of previous work to induce lexical categories • Algorithm for induction of lexical categories

POS classes and language acquisition • POS categories such as Noun, Verb, Adjective, etc. • Two theories about their source: • Rationalist: categories are hard-wired in the brain • Counterevidence: categories are not universal across languages • Empiricist: categories are learned • Can be learned by distributional clustering algorithms • Accounts for variability in POS categories across languages

The linking problem • Assume there is a “language of thought” • We have a predisposition to view the world in terms of actions, objects, properties, etc. • The linking problem • If POS (grammatical) categories are learned, they must be mapped onto the semantic representation in order to be used • If POS categories are innate, words in the external language must be mapped onto them, through experience • Combines rationalist and empiricist points of view

Distributional POS induction • Without any knowledge of categories, the (initial) process for learning categories must be distributional, even under a nativist view • Intuition: neighboring words • the ___ of • to ____ • This is the “distributional learning” hypothesis

Matrix of frequencies of words x contextual features

What result are we trying to obtain? • POS induction: word classes or lexical categories? • “Word classes” • Discover categories according to distributional context • If the number of classes is high, indicate fine-grained syntactic and/or semantic classes • Lexical categories • “Nouns”, “Verbs”, “Adjectives” • Set of categories used by linguists (for English and English-like languages)

Can we use K-means clustering for POS induction? • Yes, if you believe that POS categories are innate: • Nouns, Verbs, Adjectives == 3 separate clusters (won’t work that well… see later sections) • No, if you believe that POS categories are derived from experience • K-means is not the proper algorithm, since the number of clusters is hard-coded • # of open-class POS categories varies across languages • Want an algorithm that produces a variable number of clusters • Agglomerative / hierarchical clustering

Agglomerative clustering(also called hierarchical clustering) • Pre-compute similarity matrix between every pair of points being clustered. • Algorithm: • Each data point begins in its own cluster • Successively merge least distant / most similar clusters together • Using some definition of distance between clusters • Produces a dendrogram describing clustering history • Similarity between merged clusters decreases through iterations of clustering • Obtain a discrete set of clusters by choosing a cutoff level for similarity • Number of clusters is not fixed in advance (unlike k-means)

Agglomerative clustering • Initially, each item in its own cluster • At each iteration, merge 2 most similar (least distant) clusters

Agglomerative clustering

Agglomerative clustering: dendrogram

Another example of a dendogram

Quantifying the distance between 2 clusters • Single-link clustering: • The distance between 2 clusters is the shortest distance between any 2 points in the two clusters • Complete-link clustering: • The distance between 2 clusters is the longest distance between any 2 points in the two clusters • Average-link clustering: • The distance between 2 clusters is the average distance between all pairs of points in the two clusters • Produces different results

Single-link clustering • Distance between 2 clusters is the shortest distance between any 2 points in each cluster http://www.solver.com/xlminer/help/HClst/HClst_intro.html

Complete-link clustering • Distance between 2 clusters is the longest distance between any 2 points in each cluster

Average-link clustering • Distance between 2 clusters is the average of the distances between all pairs of points in each cluster

Produce a discrete set of clusters • A dendrogram shows the clustering process, where the end result is a single cluster containing all the data points • To produce a discrete set of clusters, we need to pick a cutoff value for the similarity • 2 ways to use threshold value for similarity: • We could grow the entire dendrogram and then “prune” it to produce a discrete set of clusters, • Or we could stop merging clusters once they reach a certain level of similarity

# of clusters for different similarity thresholds Sim = 5%: 1 cluster Sim = 20%: 4 clusters Sim = 50%: 6 clusters Sim = 80%: 8 clusters

Model selection in k-means and agglomerative clustering • K-means: number of clusters is determined by choice of the constant k • Agglomerative: number of clusters not explicitly stated • However, the # of clusters is indirectly determined, through the choice of threshold for similarity (learning bias) • No “magic formula” for the best value for similarity threshold

NLP research in psycholinguistics • Psycholinguistics • Typically involves experiments on human subjects • But there is also some research on algorithmic models, that are tested on corpora • Use the CHILDES corpus • Child Language Data Exchange System • http://childes.psy.cmu.edu/ • Transcripts of adult-child conversations, for many languages

Examples of parent-child conversation

Redington, Fitch, & Chater (1998) • Applied to English CHILDES corpus: • Child-directed speech, 2.5 million words • Words that are being clustered: • 1,000 most-freq words in corpus • Contextual features: • w-2, w-1, w+1, w+2 for the 150 most-freq words in corpus • Similarity function: • Rank correlation, rescaled from [-1, 1] to [0, 1] • Algorithm: • Average-link agglomerative clustering

Redington, Fitch, & Chater (1998)

Baker 2005

Lexical category induction • “Lexical” = has meaningful content, non performing grammatical function, is open-class, • Nouns, Verbs, Adjectives • Assumed in traditional grammars and many linguistic theories • (Ignore typological problems…) • What kind of learning procedure could acquire these categories from data? • How could a child acquire these categories?

Standard distributional clustering doesn’t exactly find lexical categories • Next slides: pick similarity threshold for producing discrete set of clusters • Based on Redington et. al. (1998) • No threshold produces clusters in exact one-to-one correspondence with Nouns, Adjectives, and Verbs

One category

Two categories

all Verbs Nouns, Adjectives conflated

Three Verb clusters Nouns, Adjectives separate

Mintz, Newport, & Bever (2002) • Similar to Redington et al. • Next slide: horizontal lines shows low similarity thresholds and resulting clusters • Same problems as Redington et al. (1998)

At a high similarity threshold, Nouns are grouped with Adjectives At a lower similarity threshold, we have 4 clusters, but Nouns are still grouped with Adjectives If we want Nouns and Adjectives to be separate, there would be 6 clusters

Distributional theory of POS categoriesdoesn’t work • Derived from experience • Form classes: • Define word class by context • Examples: • class 1: the ___ of • class 2: to ___ • Firth 1957: “Ye shall know a word by its context” • Doesn’t work for finding the lexical categories: • Nouns not separated from Adjectives, unless there are too many clusters • No one-to-one correspondence between cluster and open-class category

Interpret as a procedure that a child is using • If a child is using distributional context to learn POS categories, • Then, based on experimental results on corpora, • The theory does not predict an exact correspondence between induced categories and (psycho)linguists’ standard lexical categories of “Noun”, “Verb”, and “Adjective” • Still an open problem

Some limitations of distributional context • Contextual feature does not always determine the class of a word • the ___ (noun, adjective, adverb) • Contextual feature does not predict an entire class of words • a ___ (noun beginning with consonant) • an ___ (noun beginning with vowel) • Masculine / Feminine words, vs. nouns in general

Some limitations of distributional context • Example: not able to represent generalization “Adjectives appear to the left of Nouns” • Define “adjective” • Words in “adjective” cluster • Defined by presence of specific words to left/right, rather than the presence of a particular category • Example: “Adjectives appear to the left of “cat”” • 1st-order distr. context is linguistically inadequate • These features limit what any clustering algorithm can do

LING / C SC 439/539 Statistical Natural Language Processing