1 / 15

Visualization of Document Corpus

Visualization of Document Corpus. Bla ž Fortuna Marko Grobelnik Dunja Mladeni č. Motivation. We have a larger collection of text documents, what are the main topics in the set? which documents are related and how? which topics are related and how?

braswellm
Download Presentation

Visualization of Document Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Visualization of Document Corpus Blaž Fortuna Marko Grobelnik Dunja Mladenič

  2. Motivation We have a larger collection of text documents, • what are the main topics in the set? • which documents are related and how? • which topics are related and how? • how to enable the user to explore document space?

  3. Document representation • Bag-of-words • Documents are encoded as vectors • Each element of vector corresponds to frequency of one word • These vectors live in a very high dimensional space (dimensionality == number of distinct words in collection) Computers are used in increasingly diverse ways in Mathematics and the Physical and Life Sciences. This workshop aims to bring together researchers in Mathematics, Computer Science, and Sciences to explore the links between their disciplines and to encourage new collaborations.

  4. Problem • Documents in bag-of-words representation live in a very high dimensional space – usually >10,000 dims! • For visualisation the number of dimensions must be reduced to just 2!

  5. Latent Semantic Indexing (LSI) The Big Picture >100 >10,000

  6. Latent Semantics Indexing What is LSI? • A linear technique for finding words with similar meaning based on concurrences in the documents • Similar words are grouped into latent variables (concepts), one word can appear in more concepts • Documents are described by these concepts instead of words (== much lower dimension). Background • Uses Singular Value Decomposition (SVD) to find the best low-dimensional approximation of the documents. • Latent variables are the basis vectors of this low-dimensional subspace

  7. Latent Semantic Indexing (LSI) Multidimensional scaling (MS) The Big Picture >100 >10,000 2

  8. Multidimensional scaling • Non-linear technique for dimensionality reduction • Finds a position of points in lower dimension space so that the Euclidian distances best match original distances • Iterative gradient descent algorithm • We use it to position documents into two dimensional plane

  9. Latent Semantic Indexing Multidimensional scaling The Big Picture >100 >10,000 2

  10. Density of points is used to generate a landscape. Landscape is used as a background – lighter is higher. Clusters of high density can be emphasized by drawing contour lines. Landscape generation Document Document Documents

  11. Keywords Each point from the plane can be assigned a set of keywords by averaging TFIDF vectors of documents close to the point. Keyword

  12. Keywords User can also zoom in and check keywords for a specific area. Area Keywords

  13. Demo on two document collections • Documents == Scientific papers from PASCAL network, only abstract text is used • Documents == Researchers from PASCAL network, each researcher is described by abstracts of papers he/she co-authored.

  14. Trip into the third dimension

  15. Thank you for listening! • Questions?

More Related