210 likes | 329 Views
A Latent Semantic Indexing-based approach to multilingual document clastering. Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee. Outline. Introduction Latent Semantic Indexing(LSI)
E N D
A Latent Semantic Indexing-based approach to multilingual document clastering Chih-Ping Wei, Christopher C. Yang, Chia-Min Lin Decision Support Systems 45 (2008) 606-620 Reporter : Yi Ru, Lee
Outline Introduction Latent Semantic Indexing(LSI) LSI-based multilingual document clustering technique Empirical evaluation Conclusion
Introduction • Translation-based • Synonymy • Polysemy • vocabulary • Multilingual space • Latent Semantic Indexing(LSI) • Lexical matching • Reduce the dimensions
Latent Semantic Indexing(con.) Singular Value Decomposition (SVD)
LSI-based multilingual document clustering technique(con.) Multilingual semantic space analysis
LSI-based multilingual document clustering technique(con.) Document folding-in
LSI-based multilingual document clustering technique(con.) Dj denote the LSI dimension j Wji is the weight of document i in Dj Dimension Selection
LSI-based multilingual document clustering technique(con.) • Clustering • Hierarchical clustering algorithm
Empirical evaluation(con.) TA is the set of associations in the true categories. GA is the set of associations in the clusters generated by the document clustering technique. CA is the set of correct associations that exists in both the clusters and the true categories.
Empirical evaluation(con.) TA={(e1−e2),(c1−c2), (e1−c1), (e1−c2), (e2−c1), (e2−c2), (e3−e4),(c3−c4), (c3−c5), (c4−c5), (e3−c3), (e3−c4), (e3−c5), (e4−c3), (e4−c4), (e4−c5)} GA={(e1−e2), (c1−c3), (e1−c1), (e1−c3), (e2−c1), (e2−c3), (e3−e4), (e3−c2), (e4−c2), (c4−c5)} CA={(e1−e2), (e1−c1), (e2−c1), (e3−e4), (c4−c5)} Examples
Empirical evaluation(con.) PRT curves of the LSI-based MLDC technique
Empirical evaluation(con.) Comparisons of different representation schemes
Empirical evaluation(con.) Effect of dimension selection (h=5 for MLDC with dimension selection; k=5 for MLDC without dimension selection)
Empirical evaluation(con.) Effect of dimension selection (h=20 for MLDC with dimension selection; k=20 for MLDC without dimension selection)
Empirical evaluation(con.) Best scenario versus best scenario comparison
Empirical evaluation(con.) PRT curves of overall, monolingual, and cross-lingual performance
Conclusion monolingual PRT curve > overall PRT curve > cross-lingual PRT curve Specific domain