200 likes | 381 Views
Dimensionality reduction by random projection and latent semantic indexing. Ângelo Cardoso IST/UTL December 2009. Jessica Lin and Dimitrios Gunopulos. Outline. Introduction Latent Semantic Indexing (LSI) Random Projection (RP) Combining LSI and Random Projection Experiments
E N D
Dimensionality reduction by random projection and latent semantic indexing Ângelo Cardoso IST/UTL December 2009 Jessica Lin and DimitriosGunopulos
Outline • Introduction • Latent Semantic Indexing (LSI) • Random Projection (RP) • Combining LSI and Random Projection • Experiments • Dataset and pre-processing • Document Similarity • Document Clustering
IntroductionLatent Semantic Indexing • Vector-space model • Term-to-document matrix where each entry is the relative frequency of a term in the document • Find a subspace with k dimensions to project the original term-to-document matrix • SVD is the optimal solution in mean squared error sense • Speed up queries • Address synonymy • Find the intrinsic dimensionality of data
Introduction Random Projection • What if we randomly construct the subspace to project? • Johnson-Lindenstrauss lemma • If points in vector space are projected onto a randomly selected subspace of suitably high dimensions, then the distances between the points are approximately preserved • Making the subspace orthogonal is computationally expensive • However we can rely on a result by Hecht-Nielsen: • In a high-dimensional space, there exists a much larger number of almost orthogonal than orthogonal directions
Combining LSI and Random ProjectionMotivation • LSI • Captures the underlying semantics • Highly accurate • Can improve retrieval performance • Time complexity is expensive • O(cmn) where m is the number of terms, c is the average number of terms per document and n the number of documents • Random Projection • Efficient in terms of computational time • Does not preserve as much information as LSI
Combining LSI and Random ProjectionAlgorithm • Proposed in • Latent Semantic Indexing: A Probalistic Analsys; Papadimitriou, C.H. and Raghavan, P. and Tamaki, H. and Vempala, S.; Journal of Computer and System Sciences; 2000 • Idea • Improve Random Projection accuracy • Improve LSI computional time • First the data is pre-processed to a lower dimension k1 using Random Projection • LSI is applied on the reduced lower-dimensional data, to further reduce the data to the desired dimension k2 • Complexity is O(ml (l + c)) • RP on original data • O(mcl) • LSI on reduced lower-dimensional data) • O(ml²)
Experiments – SimilarityDataset and Pre-processing • Two subsets of Reuters categorization text collection • Common and rare words are removed • Porter stemming • Term-document matrix representation • Normalized to unit length • Sets • Larger subset • 10377 documents • 12113 terms • Term-document matrix density is 0,4% • Smaller subset • 1831 documents • 5414 terms • Term-document matrix density is 0,8%
Experiments – SimilarityLayout • Three techniques for dimensionality reduction are compared • Latent Semantic Indexing (LSI) • Random Projection (RP) • Combination of Random Projection and LSI (RP_LSI) • The dimensionality of the original data is reduced to lower k-dimensions • k = 50, 100, 200, 300, 400, 500, 600
Experiments – SimilarityMetrics • Euclidean Distance • Cosine of the angle between documents • Determining the error • Randomly select 100 document pairs and then calculate their distances before and after dimensionality reduction • Compute the correlation between the distance vectors before (x)and after (y) dimensionality reduction • Error is defined as
The best technique in terms of error is LSI as expected We can see that RP_LSI improves the accuracy of RP in terms of euclidean distance and dot product Experiments - SimilarityDistance before and after dimensionality reduction * RP_LSI: k1 = 600
The amount of the second reduction (the final dimension) is more important to achieve a smaller error than the amount of the first reduction This suggests that LSI plays a more important role in preserving similarity than RP Experiments - SimilarityRP_LSI - k1 and k2 parameters
Experiments - SimilarityRunning Time • RP_LSI performs slightly worse than LSI for the larger dataset (more sparse) • RP_LSI achieves a significant improvement over LSI in the smaller dataset (less sparse) * RP_LSI: k1 = 600
Experiments – ClusteringLayout • Clustering is applied on the data before and after dimensionality reduction. • Experiments are performed on the smaller dataset • Clustering algorithm choosen is classic k-Means • Effective • Low computional cost • Documents vectors are normalized to unit lenght before clustering • Centroids are normalized to unit lenght after clustering
Experiments – Clusteringk-Means • k-Means objective function is to minimize the sum of intra-cluster errors • The quality of dimensionality reduction is evaluated using this criterion • Since the dimensionality of data is reduced we have to compute this criteria on the original space to make the comparison possible • The number of clusters is set to 5 • Since it’s rougly the number of main topics in the dataset • Initialization is random • k-Means is repeated 20 times for each experiment and the average is taken
Experiments – ClusteringResults • LSI and RP_LSI show results similar to the original data even for smaller dimensions • RP shows significantly worse performance for smaller dimensions and more similar performance for larger dimensions • LSI shows slightly better results than RP_LSI • Clustering results using euclidean distance are similar
Conclusion • LSI and Random Projection were compared • The combination of Random Projection and LSI is analyzed • The sparseness of the data seems to play central role in the effectiveness of this technique • The technique appears to be more effective the less sparse the original data is • SVD complexity is linear on the sparseness of the data • Random Projection makes the data completely dense • The gain in reducing first the data dimensionality rivals with the additional complexity added to the SVD calculation by making the data completely dense
Conclusion • Additional experiments are necessary to prove that it is indeed the sparsness of the data that causes the discrepancy on the running time to what was previously expected • Other dimensionality reduction algorithms that preserve the sparseness of the data might be useful in improving the running time of LSI