110 likes | 124 Views
Independent Components in Text. Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian. Vector Space Representations.
E N D
Independent Components in Text Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian
Vector Space Representations • Indexing: Forming a team set of all words occurring in the database. -- Form term set -- Document -- Term-document matrix
Vector Space Representations • Weighting: Determine the values of the weights • Similarity measure: based on inner product of weight vectors or other metrics
LSI-PCA Model • The main objective is to uncover hidden linear relations between histograms, by rotating the vector space basis. • Simplify by taking the k largest singular values
ICA—Noisy Separation • Model: X=AS+U • Assumptions: -- I.I.d. Sources -- I.I.d. and Gaussian noise with variance and -- Source distribution:
ICA—Noisy Separation(cont.) • Known mixing parameters, e.g. A, -- Bayes formula: P(S|X)œ P(X|S)P(S) -- Maximizing it w.r.t.S -- Solution: -- For low noise level
ICA (cont.) • Text representations on the LSI space • Document classification • Key words -- Back projection of documents to the original vector histogram space
ICA (cont.) • Generalisation error -- Principle tool for model selection • Bias-variance dilemma: -- Too few components, leading high error -- Too many components, leading ”overfit”
Examples • MED data set -- 124 abstracts, 5 groups, 1159 terms • Results: -- ICA is successful in recognizing and ”explaining” the group structure.
Examples • CRAN data set -- 5 classes, 138 documents, 1115 terms • Results: -- ICA identified some group structure but not as convincingly as in the MED data
Conclusion • ICA is quite fine • Independence of the sources may or may not be well aligned with a manual labeling