400 likes | 408 Views
This paper presents the results from the SMART project, which investigates methods for cross-language information retrieval using dictionary adaptation and latent semantic methods based on Canonical Correlation Analysis.
E N D
Machine Learning for Textual Information Access: Results from the SMART project Nicola Cancedda, Xerox Research Centre Europe First Forum for Information Retrieval Evaluation Kolkata, India, December 12th-14th, 2008 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA
The SMART Project • Statistical Multilingual Analysis for Retrieval and Translation (SMART) • Information Society Technologies Programme • Sixth Framework Programme, “Specific Target Research Project” (STReP) • Start date: October 1, 2006 • Duration: 3 years • Objective: bring Machine Learning researchers to work on Machine Translation and CLIR
Premise and Outline • Two classes of methods for CLIR investigated in SMART • Methods based on dictionary adaptation for the cross-language extension of the LM approach in IR • Latent semantic methods based on Canonical Correlation Analysis • Initial plan (reflected in abstract): to present both • ...but it would take too long, so: • Outline: • (Longish) introduction to state of the art in Canonical Correlation Analysis • A number of advances obtained by the SMART project • For lexicon adaptation methods: check out deliverable D 5.1 from the project website!
Canonical Correlation Analysis • Abstract view: • Word-vector representations of documents (or queries, or whatever text span) are only superficial manifestations of a deeper vector representation based on concepts. • Since they cannot be observed directly, these concepts are latent • If two spans are the translation of one another, their deep representation in terms of concepts is the same. • Can we recover (at least approximately) the latent concept space? Can we learn to map text spans from their superficial word appearance into their deep representation? • CCA: • Assume mapping from deep to superficial representation is linear • Estimate mapping from empirical data
Five documents in the world of concepts 1 3 2 4 5
The same five documents in two languages 1 3 2 3 5 1 4 2 4 5
5’ 3’ 4’’ 2’’ 1’ 1’’ 5’’ 3’’ 4’ 2’ Finding the first Canonical Variates 1 3 2 3 5 1 4 2 4 5
Finding the first Canonical Variates Maximal covariance to work back the rotation • Find the two directions, one for each language, such that projections of documents are maximally correlated. • Assuming data matrices X and Y are (row-wise) centered: Normalization by the variances to adjust for “stretched” dimensions C1 expressed in the basis of X and Y resp.
Complexity: Finding the first Canonical Variate • Find the two directions, one for each language, such that projections of documents are maximally correlated • Turns out equivalent to finding the largest eigen-pair in a Generalized Eigenvalue Problem (GEP):
Finding further Canonical Variates • Assume we already found i-1 pairs of Canonical Variates: • Turns out equivalent to finding the other eigen-pairs in the same GEP
Kernel CCA • Cubic complexity in the number of dimensions becomes soon intractable, especially with text • Also, it could be better to use similarity measures other than inner product of document (possibly weighted) vectors • Kernel CCA: from primal to dual formulation, since it can be proved that the wxi (resp. wyi) is in the span of the columns of X (resp. Y)
Complexity: Kernel CCA • The computation is again done by solving a GEP:
Unit variances Unit variances Overfitting • Problem: if m · nx and m · ny then there are (infinite) trivial solutions with perfect correlation : OVERFITTING • E.g. two (centered) points in R2: Given an arbitrary direction in the first space... ...we can find one with perfect correlation in the second 1 1 2 2 Perfect correlation... for no matter what direction! Unit covariance
Regularized Kernel CCA • We can regularize the objective function by trading correlation against good account of variance in the two spaces:
5 1 3 2 2 1 4 3 4 5 1 3 3 4 2 2 5 1 4 5 Multiview CCA • (K)CCA can take advantage of the “mutual information” between two languages... • ...but what if we have more than two? Can we benefit from multiple views? Also known as Generalised CCA.
. Multivariate Eigenvalue Problem Multiview CCA • There are many possible ways to combine pairwise correlations between views (e.g. sum, product, min, ...). • Chosen approach: SUMCOR [Horst-61]. With a slightly different regularization than above, this is:
Multiview CCA • Multivariate Eigenvalue Problems (MEP) are much harder to solve then GEPs: • [Horst-61] introduced an extension to MEPs of the standard power method for EPs, for finding the set of first canonical variates only • Naïve implementations would be quadratic in the number of ducuments, and scale up to no more than a few thousands
Innovations from SMART • Extensions of the Horst algorithm [Rupnik and Shawe-Taylor] • Efficient implementation linear in the number of documents • Version for finding many sets of canonical variates • New regression-CCA framework for CLIR [Rupnik and Shawe-Taylor] • Sparse KCCA [Hussain and Shawe-Taylor]
Efficient Implementation of Horst algorithm • Horst algorithm starts with a random set of vectors: • then iteratively multiplies and renormalizes until convergence: Inner loop: k2 matrix-vector multiplications, each O(m2) • Extension (1): exploiting the structure of the MEP matrix, one can refactor computation and save a O(k) factor in the inner loop. The inner loop can be made O(kms) instead of O(k2m2) • Extension (2): exploiting sparseness of the document vectors, one can replace each (vector) multiplication with a kernel matrix (O(m2)) with two multiplications with the document matrix (O(ms) each, where s is the max number of non-zero components in document vectors). • Leveraging this same sparsity, kernel inversions can be replaced by cheaper numerical linear system resolutions.
Extended Horst algorithm for finding many sets of canonical variates • Horst algorithm only finds the first set of k canonical variates • Extension (3): maintain projection matrices Pit that project ¯k,t’s at each iteration onto the subspace orthogonal to all previous canonical variates for space i. Finding d sets of canonical variates can be done in O(d2mks). This scales up!
MCCA: Experiments • Experiments: mate retrieval with Europarl • 10 languages, • 100,000 10-ways aligned sentences for training • 7873 10-ways aligned sentences for testing • Document vectors: uni-, bi- and tri-grams (~200k features for each language). TF*IDF weighting and length normalization. • MCCA used to extract d = 100-dimensional subspaces • Baseline alternatives for selecting new basis: • k-means clustering centroids on concatenated multi-lingual document vectors • CL-LSI, i.e. LSI on concatenated vectors
MCCA experiment results • Measure: recall in Top 10, averaged over 9 languages
MCCA experiment results • More realistic experiment: now pseudo-queries formed with top 5 TF*IDF scoring components in each sentence
Extension (4): Regression - CCA • Given a query q in one language, find the target language vector w which is maximally correlated to it: • Solution: • Given this “query translation” we can then find the closest target documents using the standard cosine measure • Promising initial results on CLEF/GIRT dataset: better then standard CCA, but cannot take thesaurus into account, so MAP still not competitive with the best
Extension (5): Sparse - KCCA • Seeking sparsity in dual solution: first canonical variates expressed as linear combinations of only relatively few documents • Improved efficiency • Alternative regularization Same set of indices i
Sparse - KCCA • For a fixed set of indices i: • But how do we select i ?
Algorithm 2 • Set i to the index of the top d values of • Solve GEP for index set i Sparse – KCCA: Algorithms • Algorithm 1 • initialize • for i = 1toddo Deflate kernel matrices • end for • Solve GEP for index set i Deflation consists in transforming the matrices to reflect a projection onto the space orthogonal to the current basis in feature space
Sparse – KCCA: Mate retrieval experiments Europarl, English-Spanish KCCA Train: 24693 sec. Test: 27733 sec. SKCCA (1) Train: 5242 sec. Test: 698 sec. SKCCA (2) Train: 1873 sec. Test: 695 sec.
SMART - Website D 5.1 on lexicon-based methods and D 5.2 on CCA • Project presentation and deliverables • http://www.smart-project.eu
SMART - Dissemination and Exploitation • Platforms for showcasing developed tools:
Shameless plug • Cyril Goutte, Nicola Cancedda, Marc Dymetman and George Foster, eds: Learning Machine Translation, MIT Press, to appear in 2009.
Self-introduction Machine Learning (kernels for text) Text Categorization Grammar Learning (Statistical) Machine Translation ca. 2004 Natural Language Generation