190 likes | 358 Views
ICA of Text Documents. Jaakko Peltonen jaakko.peltonen@hut.fi 26 October 2000. Based on Unsupervised Topic Separation and Keyword Identification in Document Collections: A Projection Approach Ata Kabán and Mark Girolami. 1 Introduction.
E N D
ICA of Text Documents Jaakko Peltonenjaakko.peltonen@hut.fi 26 October 2000 Based on Unsupervised Topic Separation and Keyword Identification in Document Collections: A Projection Approach Ata Kabán and Mark Girolami
1 Introduction • ICA : proposed as a useful technique for findingmeaningful directions in multivariate data • The objective function affects the form of potential structurediscovered • Here, the problem is partitioning and analysis of sparse multivariate data • Prior knowledge is used to derive a computationally inexpensive ICA
2 Introduction, continued • Two complementary architectures: • Skewness (asymmetry) is the right objective to optimize • The two tasks will be unified in a single algorithm • Result: - fast convergence - computational cost linear in training points separate Observeddocuments Documentprototypes separate Observedwords Topic-features
term T term 1 doc 1 DT doc N 3 Data Representation • Vector space representation: document [ t1, t2, . . . , tT ]T • T = number of words in the dictionary (tens of thousands) • elements are binary indicators or frequencies sparse representation • D = term document matrix (T N, N = number of documents)
4 Preprocessing • Assumption: observations = noisy expansion of some denser group oflatent topics • Number of clusters or topics set a priori • K-dimensional LSA spaceUSED AS topic-concepts subspace • PCA may lose important data components:sparsity infrequent, meaningful correlation less concern • Reconstruction: D »DK=UEVT
5 Prototype Documents from a Corpus Assumption: documents = noisy linear mixture of (~independent) document prototypes • N. of prototypes = n. of topics prototypes reside in LSA-space (K dimensions) • Data projection onto right eigenvectors + variance normalizationX(1):=E-1VT(DT)=UT(K T matrix) • Task: find mixing matrix W(1), source documents S(1) so thatX(1)=W(1)TS(1)(S(1) : K T matrix)
term T term 1 term T term 1 doc 1 topic 1 DT S(1) topic K doc N 6 Prototype Documents from a Corpus, continued • Basis vectors of topic space assumed different to separate prototypes, find independent componentsWords in documents are distributed in a positively skewed way • Search restricted to skewed (perhaps asymmetric) distributions • LSA unmixing matrix must be orthogonal ( W(1)-1=W(1)T) W(1)E-1VT
7 Prototype Documents from a Corpus, continued • Objective: Skewness measure Fisher-skewness : • Prior knowledge: small component mean projection variance restricted to unity Simplified objective G(s) ( 3rd order moment) • Prevent degenerate solutions Restrict wTw=1 for stationary points • Solve with gradient methods or iteratively {
8 Prototype Documents from a Corpus, continued • Sources positive is positive (output sign is relevant!) • K orthonormal projection directions matrix iteration • Similar to approximate Newton-Raphson optimization(FastICA type derivation small additional term) • Computational complexity: O(2K2T + KT + 4K3)
9 Topic Features from Word Features Assumption: terms = noisy linear expansion of (~independent) concepts (topics) • Data compression:X(2):=E-1UT(D)=VT(K N matrix) • Task: find unmixing matrix W(2), topic features S(2) so thatX(2)=W(2)TS(2)(S(2) : K N matrix) • This time, use a Clustering criterion
doc 1 doc N doc N doc 1 term 1 topic 1 S(2) D topic K term T 10 Topic Features from Word Features, continued W(2)E-1UT • Objective function (zkn indicate class of xn) • Stochastic minimization EM-type algorithm: {
11 Topic Features from Word Features, continued • Comparison approach: set of binary classifiers algorithm: • Maximizes: = skewed, monotonic increasing function of topic skskewed prior is appropriate • Variance normalized after LSA, independent topics source components aligned to orthonormal axes • Similar to previous architechture {
12 Combining the Tasks • Joint optimization problem: • Information from linear outputs and from weights are complementary: Topic clustering weight peaks representative words projections clustering information Document weight peaks clustering information prototype search projections index terms • Review the separating weights on D: W(2)TE-1UT { {
13 Combining the Tasks, continued • Whitening allows inspection but isn't practical normalize variance along the K principal directions! D' := UE-1UTD • Find new unmixing matrix to maximize W(2') G(W(2')TUTD') = ... = G(W(2')TX(2)) W(2') = W(2) • Solve the relation : W(2)TUT=S(1) W(1)TUT=S(1) • Rewrite objective: concatenate data: [UT, VT] } W(1)=W(2)=W
14 Combining the Tasks, continued • Resultant algorithm : O(2K2(T + N) + K(T + N) + 4K3) Inputs: D, K 1. Decompose D with Lanczos algorithm. Retain K first singular values. Obtain U, E, V. 2. Let X = [UT, VT] 3. Iterate until convergence: Outputs: SℝK(T+N) , WℝKK • S: [T document prototypes N topic-features], W: structure information of identified topics in the corpus
15 Simulations 1. Newsgroup data ('sci.crypt', 'sci.med', 'sci.space', 'soc.religion.christian') kei effect space peopl encrypt year nasa christian system call orbit god chip peopl dai rutger secur medic year thing govern question system church clipper ve high bibl public doctor launch question peopl find man part escrow patient scienc find comput studi engin christ medic god space kei patient christian nasa encrypt year peopl orbit secur effect rutger launch govern diseas thing dai system doctor bibl mission clipper studi christ flight chip health understand engin public call church shuttl escrow test point system de physician question scienc law 10 most representative words 10 most frequent wordsselected by algorithm conformal with human labeling people god dai space sex church man year system issu group life nasa shuttl term thing love moon design sexual year christian jpl research basi find live earth cost respons question jesu orbit human homosexu bibl christ part discuss refer read rutger gov launch fornic faith human ron dr intercours issu save venu station law Simulation 2.10 most representative words,using 5 topics and 2 document classes('sci.space', 'soc.religion.christian') I II III IV V
16 Conclusions Dependency structure of the splitting in simulation 2 sci.space soc.religion.christian space shuttle space shuttle christian christian christian design (IV) mission (III) church (I) religion (II) morality (V) • Clustering and keyword identification by ICA variant that maximizes skewness • Key assumption: asymmetrical latent prior • Joint problem solved (D and DT) 'spatio-temporal' ICA • Algorithm is linear in number of documents, O(K2N) • Fast convergence (3 - 8 steps) • Potential number of topics can be greater than indicated bya human labeler discover subtopics • Hierarchical partitioning possible (recursive binary splits)
17 Further Work x='sci.crypt', o='sci.space', ='sci.med', ·='soc.religion.christian' 1 2 3 • Study links with other methods improve flexibility • Or develop a mechanismto allow more structuredrepresentation, in a mixed or hierarchical manner • For example: build in model-estimation to the algorithm • Relax equal wk norm assumption