An Information Theoretic Approach to Bilingual Word Clustering

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU

Word Clustering Grouping of words capturing syntactic, semantic and distributional regularities 11 Iran good London 13.4 USA nice 22,000 better India 100 awesome Paris cool play laugh eat run fight

Bilingual Word Clustering • What ? • Clustering words of two languages simultaneously • Inducing a dependence between the two clusterings • Why ? • To obtain better clusterings (hypothesis) • How ? • By using cross-lingual information

Bilingual Word Clustering Assumption: Aligned words convey information about their respective clusters

Bilingual Word Clustering Existing: Monolingual Models Proposed: Monolingual + Bilingual Hints

Related Work • Bilingual Word Clustering (Och, 1999) • Language model based objective for monolingual component • Word alignment count-based similarity function for bilingual • Linguisticstructure transfer (Täckstromet al. 2012) • Maximize the correspondence between clusters of aligned words • Alternateoptimizationof mono & bi objective • Clustering of only top 1 million words • POS tagging (Snyder & Barzilay, 2010) • Word sensedisambiguation (Diab, 2003) • Bilingualgraphbasedprojections (Das and Petrov, 2011)

Monolingual Objective c1 c2 c3 c4 C S w1 w2 w3 w4 (Brown, 1992) P(S;C) = P(c1) * P(w1|c1) * P(c2|c1) * P(w2|c2) * … Maximize the likelihood of the word sequence given the clustering Minimize the entropy (surprisal) of the word sequence given the clustering H(S;C) = E [ -log P(S;C) ]

Bilingual Objective Maximize the information we know about one clustering given another 1 1 2 2 Language 1 Language 2 3 3 Word alignments

Bilingual Objective Minimizethe entropy of one clustering given the other 1 1 2 2 Language 1 Language 2 3 3 Word alignments

Bilingual Objective For aligned words x in clustering C andy in clustering D, The association between Cxand Dycan be written as: p(Cx|Dy) + p (Dy|Cx) Where, a Cx Dy b p(Dy|Cx) = a / (a + b) Cw Dz c

Bilingual Objective • Thus for the two clusterings, • AVI (C, D) = E(i, j)[ -log p(Ci|Dj) – log p (Dj|Ci) ] • Aligned Variation of Information • Captures the mutual information content of the two clusterings • Has distance metric properties • Non-negative: AVI (C, D) > 0 • Symmetric: AVI (C, D) =AVI (D, C) • Triangle Inequality: AVI (C, E) ≤ AVI (C, D) +AVI (D, E) • Identity of Indiscernibles: AVI (C, D) = 0, iff C ≅ D Aligned Variation of Information

Joint Objective α[ H (C) + H (D) ] +ß AVI (C, D) Monolingual Bilingual Word sequence information Cross lingual information α,ß are the weights of the mono and bi objectives resp.

Inference Bilingual Monolingual Monolingual & Bilingual Word Clustering We want to do a MAP inference on the factor graph

Inference • Optimization • Optimal solution is a hard combinatorial problem (Och, 1995) • Greedy hill climbing word exchange (Martin et al., 1995) • Transfer word to the cluster with max improvement • Initialization • Round-robin based on frequency • Termination • No. of words exchanged < 0.1% (vocab1 + vocab2) • At least 5 complete iterations

Evaluation Evaluation Named Entity Recognition (NER) • Core information extraction task • Very sensitive to word representations • Word clusters are useful for downstream tasks (Turian et al, 2010) • Can be directly used as features for NER • English(Finkel & Manning, 2009), German(Faruqui & Padó, 2010)

Data and Tools • German NER • Training & Test data: CoNLL 2003 • 220,000 and 55,000 tokens resp. • Corpora for clustering: WIT-3 (Cettolo et al., 2012) • Collection of TEDtalks • {Arabic, English, French, Korean, Turkish} – German • Around 1.5 million German tokens for each pair • Stanford NER for training (Finkel and Manning, 2009) • In-built functionality to use word clusters for generalization • cdec for unsupervised word alignments (Dyer et al., 2013)

Experiments α[ H (C) + H (D) ] + ß AVI (C, D) • Baseline: No clusters • Bilingual Information Only • α = 0, ß = 1 • Objective: AVI (C, D) • Monolingual Information Only • α = 1, ß = 0 • Objective: H (C) + H (D) • Monolingual + Bilingual Information • α = 1, ß = 0.1 • Objective: H (C) + H (D) + 0.1 AVI (C, D)

Alignment Edge Filtering • Word alignments are not perfect • We filter out alignment edges between two words (x, y) if: a b x y 2 * b / ( (a + b + c) + (b + d) ) ≤ η d c • Training η for different language pairs:

Results F1 scores of German NER trained using different word clusters on the Training set

Results F1 scores of German NER trained using different word clusters on the Test set

Ongoing Work Bilingual Monolingual Multilingual Word Clustering

Ongoing Work Current work: Parallel Data Mono1 + Parallel Data Mono1 + Parallel Data + Mono2

Conclusion • Novel information theoretic model for bilingual clustering • The bilingual objective has an intuitive meaning • Joint optimization of the mono + bi objective • Improvement in clustering quality over monolingual clustering • Extendable to any number of languages incorporating both monolingual and parallel data

Thank You!

An Information Theoretic Approach to Bilingual Word Clustering

An Information Theoretic Approach to Bilingual Word Clustering

Presentation Transcript

Information Theoretic Learning

Information-Theoretic Secrecy

An Information-Theoretic Wavelet Based Approach to Multilength Scale Modeling of Materials

An Information-Theoretic Definition of Similarity

AN IFORMATION THEORETIC APPROACH TO BIT STUFFING FOR NETWORK PROTOCOLS

Cardinality-based Inference Control in OLAP Systems An Information Theoretic Approach

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon Uni

An Information-theoretic Framework for Visualization

An Automata-Theoretic Approach to Hardware/Software Co-verification

Information Theoretic Approach to Whole Genome Phylogenies

Interference: An Information Theoretic View

Clustering Word Senses

An Information-theoretic Approach to Network Measurement and Monitoring

Automata-Theoretic approach

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

An Automata-Theoretic Approach to LTL

Mining Quantitative Correlated Patterns Using an Information-Theoretic Approach

Database Normalization Revisited: An information-theoretic approach

An Order-Theoretic Approach toward Entropy

Robust Information-theoretic Clustering

An Information-Theoretic Approach to Normal Forms for Relational and XML Data

Database Normalization Revisited: An information-theoretic approach