330 likes | 841 Views
Latent Semantic Analysis A Gentle Tutorial Introduction Tutorial Resources http://cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT. M.A. Girolami. Contents. Latent Semantic Analysis Motivation Singular Value Decomposition Term Document Matrix Structure Query and Document Similarity in Latent Space
E N D
Latent Semantic AnalysisA Gentle Tutorial IntroductionTutorial Resourceshttp://cis.paisley.ac.uk/giro-ci0/GU_LSA_TUT M.A. Girolami University of Glasgow DCS Tutorial
Contents • Latent Semantic Analysis • Motivation • Singular Value Decomposition • Term Document Matrix Structure • Query and Document Similarity in Latent Space • Probabilistic Views on LSA • Factor Analytic Model • Generative Model Representation • Alternate Basis to the Principal Directions • Latent Semantic & Document Clustering (In the Bar later) • Principal Direction Clustering • Hierarchic Clustering with LSA University of Glasgow DCS Tutorial
Latent Semantic Analysis • Motivation • Lexical matching at term level inaccurate (claimed) • Polysemy – words with number of ‘meanings’ – term matching returns irrelevant documents – impacts precision • Synonomy – number of words with same ‘meaning’ – term matching misses relevant documents – impacts recall • LSA assumes that there exists a LATENT structure in word usage – obscured by variability in word choice • Analogous to signal + additive noise model in signal processing University of Glasgow DCS Tutorial
Latent Semantic Analysis • Word usage defined by term and document co-occurrence – matrix structure • Latent structure / semantics in word usage • Clustering documents or words – no shared space • Two mode factor analysis – dyadic decomposition into ‘latent semantic’ factor space - employing - Singular Value Decomposition • Cubic Computational Scaling – reasonable ! University of Glasgow DCS Tutorial
Singular Value Decomposition • M× N, Term × Document matrix (M >> N) D = [d1, d2, …, dN] and d= [t1, t2, …, tM]T Consider linear combination of terms u1t1+ u2t2+ … + uMtM = uTd which maximises E{(uTd)2} = E{uTddTu} = uT E{ddT}u ≈ uTDDTu Subject touTu = 1 University of Glasgow DCS Tutorial
Singular Value Decomposition Maximise uTDDTu s.tuTu = 1 Construct Langrangian uTDDTu–λuTu Vector of partial derivatives set to zero DDTu –λu =(DDT –λI) u = 0 As u ≠ 0 then DDT –λI must be singular i.e |DDT –λI|= 0 This is a polynomial in λ of degree M with characteristic roots – called the eigenvalues (German eigen = own, unique to, particular to) University of Glasgow DCS Tutorial
Singular Value Decomposition The first root is called the prinicipal eigenvalue which has an associated orthonormal (uTu = 1) eigenvectoru Subsequent roots are ordered such that λ1> λ2 >… > λM with rank(D) non-zero values. Eigenvectors form an orthonormal basis i.e. uiTuj = δij The eigenvalue decomposition of DDT = UΣUT whereU = [u1, u2, …, uM] and Σ= diag[λ1, λ2, …, λM] Similarly the eigenvalue decomposition ofDTD = VΣVT The SVD is closely related to the above D=U Σ1/2 VT The left eigenvectors U, right eigenvectors V, singular values = square root of eigenvalues. University of Glasgow DCS Tutorial
SVD Properties • D=U S VT= ∑i=1..NσiuiviT and DK=∑i=1..KσiuiviT = UK SK VKTandK<N : UK TUK = IK = VK TVK • ThenDKis best rank K approximation to D,inF norm sense • K-dim orthonormal projections S-1K UK TD=VKTpreserve the maximum amount of variability • Under the assumption that columns of D are multivariate Gaussian then V defines principal axes of ellipse of constant varianceλi in original space University of Glasgow DCS Tutorial
D -- 10 x 2 U -- 10 x 2 S -- 2 x 2 V T -- 2 x 2 2.9002 3.6790 4.0860 5.2366 1.9954 3.3687 3.5069 1.6748 4.4620 2.7684 -2.9444 -4.6447 -4.1132 -4.7043 -3.6208 -5.0181 -3.0558 -4.1821 -6.1204 -2.4790 -0.2750 -0.1242 -0.3896 -0.1846 -0.2247 -0.2369 -0.2150 0.3514 -0.3005 0.3318 0.3177 0.2906 0.3682 0.0833 0.3613 0.2319 0.3027 0.1861 0.3563 -0.6935 -0.6960 -0.7181 0.7181 -0.6960 16.9491 0 0 3.8491 SVD Example University of Glasgow DCS Tutorial
SVD Properties • There is an implicit assumption that the observed data distribution is multivariate Gaussian • Can consider as a probabilistic generative model – latent variables are Gaussian – sub-optimal in likelihood terms for non-Gaussian distribution • Employed in signal processing for noise filtering – dominant subspace contains majority of information bearing part of signal • Similar rationale when applying SVD to LSI University of Glasgow DCS Tutorial
Computing SVD • Power Method one numerical approach Random initialisation of vector u0 Set u1u = DDTu0 and u1 = u1u / √ (u1u)T u1u then u2u = DDTu1 and u2 = u2u / √ (u2u)T u2u Then uiu = DDTui-1 and ui = uiu / √ (uiu)T uiu As i ∞, ui u1, √ (uiu)T uiuλ1 • Subsequent EV’s use deflation u1u = (DDT - λ1u1u1T)u0 • Note for term document matrix computation of u1 Inexpensive – subsequent ev’s require matrix-vector operations on dense matrix. University of Glasgow DCS Tutorial
Term Document Matrix Structure • Create artificially heterogeneous collection • 100 documents from 3 distinct newsgroups • Indexed using standard stop word list • 12418 distinct terms • Term × Document Matrix (12418 × 300) • 8% fill of sparse matrix • Sort terms by rank – structure apparent • Matrix of cosine similarity between documents • Clear structure apparent University of Glasgow DCS Tutorial
Term Document Matrix Structure University of Glasgow DCS Tutorial
Query and Document Similarity in Latent Space • Rank 3 D3 = σ1u1v1T+ σ2u2 v2T+ σ3u3 v3T • Projection into 3-d Latent Semantic Space • of all documents achieved by S3-1U3TD • A query q in theLSA space S3-1U3Tq • Similarity in LSA space • (S3-1U3Tq)T S3-1U3TD • = qTU3S3-1S3-1U3TD • = qTU3∑3-1U3TD • = qT expD =qT Θ D • LSA similarity metric Θ – term expansion University of Glasgow DCS Tutorial
Query and Document Similarity in Latent Space • Project documents into 3-D latent space • Project query University of Glasgow DCS Tutorial
Random Projections • Important theoretical result • Random projection from M - dim to L - dim space • Where L << M then • Euclidean distance and angles (norms and inner products) are preserved with high probability • LSA can then be performed using SVD on the reduced dimensional L × N matrix (less costly) University of Glasgow DCS Tutorial
LSA Performance • LSA consistently improves recall on standard test collections (precision/recall generally improved) • Variable performance on larger TREC collections • Dimensionality of Latent Space – a magic number – 300 – 1000 seems to work fine – no satisfactory way of assessing value. • Computational cost – at present – prohibitive University of Glasgow DCS Tutorial
Probabilistic Views on LSA • Factor Analytic Model • Generative Model Representation • Alternate Basis to the Principal Directions University of Glasgow DCS Tutorial
Factor Analytic Model • d = Af + n • p(d) = ∑f p(d|f)p(f) • This probabilistic representation underlies LSA where prior and likelihood are both multivariate Gaussian. University of Glasgow DCS Tutorial
Generative ModelRepresentation • Generate a document d with probability p(d) • Having observed d generate a semantic factor with probability p(f|d) • Having observed a semantic factor generate a word with probability p(w|f) University of Glasgow DCS Tutorial
P(d) Factor 3 Factor 2 Documents P(w|f) P(f|d) Factor 1 Generative ModelRepresentation The cat sat on the mat and the quick brown fox jumped… spider University of Glasgow DCS Tutorial
Generative ModelRepresentation • Model representation as joint probability p(d,w) = p(d)p(w|d) = p(d)∑f p(w|f)p(f|d) w and d conditionally independent given f • p(d,w) = ∑f p(w|f)p(f)p(d|f) • Note similarity with DK=∑i=1..KσiuiviT University of Glasgow DCS Tutorial
P(w=spider|f4)=0.6 P(w=spider|f4)=0.02 P(w=spider|f4)=0.01 P(w=spider|f4)=0.1 p(d,w) = p(d)∑f p(w|f)p(f|d) = 0.001 p(f=4|d)=0.05 p(f=1|d)=0.6 p(f=2|d)=0.1 p(f=3|d)=0.25 The cat sat on the mat and the quick brown fox jumped… Documents P(d) = 0.003 University of Glasgow DCS Tutorial
Generative ModelRepresentation • Distributions of p(f|d) and p(w|f) are multinomial – counts in successive trials • More appropriate than Gaussian • Note that Term × Document matrix is a sample from the true distribution pt(d, w) • ∑ijD(i,j) log p(dj, wi) – cross-entropy between model and realisation – maximise likelihood that the model p(dj, wi) generated the realisation D – subject to conditions on p(f|d) and p(w|f) University of Glasgow DCS Tutorial
Generative ModelRepresentation • Estimation of p(f|d) and p(w|f) requires use of a standard EM algorithm. • Expectation Maximisation • General iterative method for ML parameter estimation • Ideal for ‘missing variable’ problems • Estimate p(f|d,w) using current estimates of p(w|f) and p(f|d) • Estimate new values of p(w|f) and p(f|d) using current estimate of p(f|d,w) University of Glasgow DCS Tutorial
Generative ModelRepresentation • Once parameters estimated • p(f|d) gives posterior probability that Semantic factor ‘f’ is associated with d • p(w|f) gives the probability of word ‘w’ being generated from Semantic factor ‘f’ • Nice clear interpretation unlike U and V terms in SVD • ‘Sparse’ representation – unlike SVD University of Glasgow DCS Tutorial
Generative ModelRepresentation • Take the toy collection generated – estimate p(f|d) and p(w|f) • Graphical representation of p(f|d) University of Glasgow DCS Tutorial
Generative ModelRepresentation • Ordered representation of p(w|f) University of Glasgow DCS Tutorial
Alternate Basis to the Principal Directions • Similarity between query and documents can be assessed in ‘factor’ space – vis. LSA • Sim = ∑f p(f|q) p(f|D) averaged product of query and doc posterior probabilities over all ‘factors’ – latent space • Alternately note that D and q are sample instances from an unknown distribution • All probabilities – word counts – estimated from D ‘noisy’ • Employ p(dj, wi) as ‘smoothed’ version of tf and use ‘cosine’ measure ∑i p(D, wi) × qi ‘query expansion’ University of Glasgow DCS Tutorial
Alternate Basis to the Principal Directions • Both forms of matching shown to improve on LSA (MED,CRAN,CACM) • Elegant statistically principled approach – can employ (in theory) Bayesian model assessment techniques. • Likelihood nonlinear function of parameters p(f|d) and p(w|f) – Huge parameter space – small number of relative samples – high bias and variance expected • Assessment of correlation with likelihood and P/R – yet to be studied in depth University of Glasgow DCS Tutorial
Conclusions • SVD defined basis provide P/R improvements over term matching • Interpretation difficult • Optimal dimension – open question • Variable performance on LARGE coll’s • Supercomputing muscle required • Probabilistic approaches provide improvements over SVD • Clear interpretation of decomposition • Optimal dimension – open question • High variability of results due to nonlinear optimisation over HUGE parameter space • Improvements marginal in relation to cost University of Glasgow DCS Tutorial
Latent Semantic & Hierarchic Document Clustering • Had enough ? …. • ….. To the Bar… University of Glasgow DCS Tutorial