1 / 11

Independent Components in Text

Independent Components in Text. Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian. Vector Space Representations.

Download Presentation

Independent Components in Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Independent Components in Text Paper by Thomas Kolenda, Lars Kai Hansne and Sigurdur Sigurdsson Yuan Zhijian

  2. Vector Space Representations • Indexing: Forming a team set of all words occurring in the database. -- Form term set -- Document -- Term-document matrix

  3. Vector Space Representations • Weighting: Determine the values of the weights • Similarity measure: based on inner product of weight vectors or other metrics

  4. LSI-PCA Model • The main objective is to uncover hidden linear relations between histograms, by rotating the vector space basis. • Simplify by taking the k largest singular values

  5. ICA—Noisy Separation • Model: X=AS+U • Assumptions: -- I.I.d. Sources -- I.I.d. and Gaussian noise with variance and -- Source distribution:

  6. ICA—Noisy Separation(cont.) • Known mixing parameters, e.g. A, -- Bayes formula: P(S|X)œ P(X|S)P(S) -- Maximizing it w.r.t.S -- Solution: -- For low noise level

  7. ICA (cont.) • Text representations on the LSI space • Document classification • Key words -- Back projection of documents to the original vector histogram space

  8. ICA (cont.) • Generalisation error -- Principle tool for model selection • Bias-variance dilemma: -- Too few components, leading high error -- Too many components, leading ”overfit”

  9. Examples • MED data set -- 124 abstracts, 5 groups, 1159 terms • Results: -- ICA is successful in recognizing and ”explaining” the group structure.

  10. Examples • CRAN data set -- 5 classes, 138 documents, 1115 terms • Results: -- ICA identified some group structure but not as convincingly as in the MED data

  11. Conclusion • ICA is quite fine • Independence of the sources may or may not be well aligned with a manual labeling

More Related