Text and Documents

Text and Documents CS 4460 - Information Visualization Jim Foley, some material courtesy John Stasko. Some examples from Marti Hearst, Search User Interfaces, Cambridge University Press, 2009

Text is Everywhere • We (may) use documents as primary information artifact in our lives • Our access to documents/information has grown tremendously in recent years • Internet infrastructure • WWW • Google, Yahoo, Bing • Digital libraries • And the amount of information has grown! CS 4460

The Key Question for InfoVis • How can InfoVis help users in gathering, understanding, using information from • Document collections (macro-level)? • Such as everything on the web • Individual documents (micro-level)? • Such as a thesaurus, or a book or speech • Shakespeare, Bible, Koran, Torah, …. CS 4460

Example Macro-level Tasks • Which documents contain text on topic XYZ? • Are there other documents that might be close enough to be worthwhile? • How do documents fit into a larger context? • What documents might be of interest to me? • Which documents have a negative/angry tone? CS 4460

Example Micro-level Tasks What are the main themes of a document? How are certain words or themes distributed through a document? How does one document compare to or relate to other documents? In what contexts is the word “inflation” used with the word “spending?” CS 4460

Related Topic – IR • Information Retrieval • The search process that locates particular entities based on selection criteria • Google search algorithms • Library catalog search • We will not discuss IR algorithms • We will discuss how InfoViscan help • Understand what can be retrieved • Understand what has been retrieved • Browse • Formulate more precise queries • Etc. CS 4460

Related Topic – Sensemaking • Making ‘sense’ of a collection of docs • Relate facts/info in document collection to create an understanding of a topic, or to ‘tell a story’ based on the facts • Discussed more in visual analytics lecture • InfoVis can help sensemaking be more rapid than without CS 4460

Challenge • Text is nominal data with a hugh (infinite) cardinality • Does not map to geometric representations as easily as ordinal and quantitative data • The step “Raw data --> Data Table” mapping now becomes very important – indeed, becomes central CS 4460

Process for Text/Doc InfoVis Vectors Keywords Etc. Data tables For InfoVis Raw Data (documents) Analysis Algorithms Visualization Decomposition Statistics Similarity Clustering Relevance Thesaurus Word count KWIC Etc. 2D, 3D display CS 4460

Challenge (Cont’d) • Unstructured text does NOT have any explicit meta-data. • Just that infinitely big collection of nominal data • Meta-data is sometimes extracted from raw text • What Jigsaw calls “entity extraction” • Google News extracts dates • Contrast to structured text of an on-line library with explicit meta-data such as • Author name • Year of publication • Title • ISBN number • Library of Congress umber • Publisher name • Etc • This meta-information is itself mostly nominal but has much lower cardinality than for Google-style free text search, which simplifies and structures the retrieval process. • We will look at a few examples in the structured meta-data space CS 4460

Document Collections • Problem or challenge is how to present the contents/semantics/themes/etc of the documents to someone who does not have time to read them all • Who are the users? • How often do YOU use Google/Yahoo/Bing?? • Students, researchers, news people, everyday people, CIA/FBI CS 4460

Outline • Macro-level – searching larger document collections • Unstructured – no meta-data • Structured – explicit meta-data • Search history • Micro-level • Inter-document methods for smaller document collections • How do retrieved documents relate to a query? • How do retrieved documents relate to one another? • Intra-document methods • Word usage, grammatical style, … • With the caveat that some methods can be used in multiple ways CS 4460

Macro-Level: Large Unstructured • LARGE does not mean entire WWW!! • A number of systems endeavor to give a “big picture view” – the “gist” of a large collection of documents • Themescape • WebThemes • Galaxies • Feature Maps/WEBSOM • (Kohonen SOM-Self Organizing Maps) • ThemeRiver CS 4460

Group has developed a number of visualization techniques for document collections Galaxies ThemeScapes ThemeRiver WebTheme PNNL Wise et al InfoVis ‘95 www.pnl.gov/infoviz CS 4460

Themescape Height/color encode document density CS 4460

ThemeRiver CS 4460

ThemeRiver Video CS 4460

WebTheme CS 4460

Galaxies Presentation of documents where similar ones cluster together CS 4460

Geo-like Maps But not very useful; no longer offered as a product CS 4460

Feature Maps (SOMs) • SOMs = Self Organizing Maps • Developed by TeuvoKohonen • Thus sometimes called Kohonen Maps • Expresses complex, non-linear relationships between high dimensional data items into simple geometric relationships on a 2-d display • Creates clusters of like things • Uses neural network techniques LinVisualization ‘92 CS 4460

WEBSOM Self-organizing map of Net newsgroups and Postings Think of as a top view of a ThemeScape, but organized with a different method http://websom.hut.fi/websom/milliondemo/html/root.html(dead link) CS 4460

Another SOM CS 4460 ai2.bpa.arizona.edu/ent/ dead link

Another SOM faculty.cis.drexel.edu/Sitemap/ dead link Xia Lin CS 4460

Another SOM CS 4460

ThemeScapes vs. SOMs • Self-organizing maps (Kohonen) don’t reflect density of regions all that well • Themescapes uses 3D representation • Height represents density or number of documents in region • Could think of SOM as top view of Themescape  CS 4460

Basic Idea to Create Maps • Break each document into its words • Two documents are “similar” if they share many words • See later aside on Vector Space Analysis • Use mass-spring graph-like algorithm for clustering similar documents together and pushing dissimilar documents far apart CS 4460

Map Attributes – What Have we Seen? • Colored areas correspond to different concepts in collection • Size of area corresponds to importance of concept relative to other concepts • Neighboring regions indicate commonalities in concepts • Adjacencies and sizes are computed from the documents themselves • Dots in regions can be used to represent documents • ResultMaps that will see later share some of these properties – but their structure is predefined CS 4460

Are these techniques useful? Strengths/weaknesses? Useful for entire set of docs on WWW? So how large is large? What determines viable size for each system/method? Map Review CS 4460

Aside - How to Characterize Documents – Vector Space Analysis • How compare similarity of two documents? Here’s one way: • Step 1, for each document • Make list of each unique word in document • Throw out common words (a, an, the, …) • Make different forms the same (bake, bakes, baked) • Store count of how many times each word appeared • Alphabetize, make into a vector • One per document CS 4460

Aside - Vector Space Analysis • To compare two doc’s, determine how closely two vectors go in same direction • Step 2 - form inner (dot) product of each doc’s vector with every other vector • Gives similarity of each document to every other one • Step 3 - use mass-spring layout algorithm to position representations of each document • Dot product => closeness • Themescape makesmountains from clusters • Some similarities tohow search engines work CS 4460

Aside – But not all Words Equal • Not all terms or words are equally useful • Often apply TFIDF • Term Frequency, Inverse Document Frequency • Weight of a word goes up if it appears often in a document, but not often in the collection CS 4460

CS 4460

Understanding Small Information Spaces • SMART – System for the Mechanical Analysis and Retrieval of Text • VIBE • Text Themes • SQWID CS 4460

SMART System • Uses vector space model for documents • May break document into chapters and sections and deal with those as atoms • Plot document atoms on circumference of circle • Atom - document, or section, or paragraph • Draw line between items if their similarity exceeds some threshold value Salton et al, Automatic Analysis, Theme Generation, and Summarization of Machine-Readable Texts, Science June 1994 CS 4460

SMART System • Four documents shown • Lines give similarity between documents, if above .20 • Items evenly spaced • Doesn’t give viewer idea of how big each section/document is • Very early system by Jerry Salton, the father of Information Retrieval CS 4460

SMART – another example • Connections between paragraphs in a single document • No weights shown • Clutter problem • How about dynamic query on weights? CS 4460

SMART – Another Example Quoting Salton: “The convexgraph structure reflects a homogeneoustreatment of the topic; in this case, the "Smoking" article emphasizes the health problems connected with smoking and the difficulties that arise when people attempt to quit smoking. For a homogeneous map such as this, it should be easy to determine the basic text content by looking at only a few carefully chosen paragraphs.” CS 4460

SMART – Another Example Again quoting Salton: In contrast, consider … (this graph with) … the same similarity threshold of 0.30. This map is much less dense; there are many outliers consisting of a single node only, and there is a disconnected component that includes paragraphs 2 and 3 of section 5. Clearly, the "Symphony" topic does not receive the same homogeneous treatment in the encyclopedia as "Smoking,” and a determination of text content by selectively looking at particular text excerpts is much more problematic in this case. CS 4460

SMART – Refined Design • Four documents depicted by arcs • Arc length => document length • Paragraph-level similarities indicated by lines • Par. position shown within doc. arc Proportional to document length Links at correct position in document CS 4460

SMART- Text Themes • Look for sets of regions in a document (or sets of documents) that all have common theme • Closely related to each other, but different from rest • Need to run clustering process CS 4460

Algorithm • Recognize triangles in relation maps • Group of 3 atoms, each related, with edges above threshold • Make a new vector that is average of 3 • Triangles merged whenever averaged vectors are sufficiently similar (ie, heading in the same direction) CS 4460

SMART – Text Theme Example • Using the preceding example, four themes emerge • Shown as four differently-shaded regions of (in some cases multiple) triangles Key to document names CS 4460

Helpful • What do you think? • Ways to improve?? CS 4460

VIBE System • Smaller sets of documents than whole library • Example: Set of 100 documents retrieved from a web search • Idea is to understand how contents of documents relate to each other Olsen et al Info Process & Mgmt ‘93 CS 4460

Visualize Keywords and Doc’s • Show relation of each Doc to Keywords • “Similar” Doc’s cluster together CS 4460

Algorithm • Example: 2 Keywords • Document 1 vector • D1(K1, K2) = (0.4, 0.8) P1 0.4 0.4+0.8 0.333 D1 P2 1/3 of way from K2 to K1 CS 4460

A VIBE Visualization CS 4460

Effectively communications relationships Straightforward methodology and vis are easy to follow Can show relatively large collections Not showing much about a document Could encode info in Doc Marks Single items lose “detail” in the presentation Starts to break down with large number of terms VIBE Pro’s and Con’s CS 4460

SQWID: Search Query Weighted Info Display (VIBE-like) • Keywords “pull” Doc’s • (University, Visualization, Tools) • Doc’s can go outside convex hull of keywords (unlike some other approaches) McCrickard and Kehoe, Visualizing Search Results using SQWID, Poster paper in Proceedings of the 6th World Wide Web Conference (WWW6), Santa Clara CA, April 1997 CS 4460

Text and Documents

Text and Documents

Presentation Transcript

Text Classification from Labeled and Unlabeled Documents using EM

FAT – Finding All Taxa (in Text Documents)

Primary Sources: Analyzing Text Documents

Primary Sources: Analyzing Text Documents

In-text citations: Government documents and reports

Models for Authors and Text Documents

Connections of Image and Text in Digital and Handwritten Documents

SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Khotanese Documents in the Pelliot Collection Saka Documents Text Volume II

ICA of Text Documents

Visualization Taxonomies and Techniques Text: Documents and Collections

PARTIALLY SUPERVISED CLASSIFICATION OF TEXT DOCUMENTS

Text Localization, Enhancement and Binarization in Multimedia Documents

Categorizing Multimedia Documents Using Associated Text

Pseudo-supervised Clustering for Text Documents

FAT – Finding All Taxa (in Text Documents)

Text Classification from Labeled and Unlabeled Documents using EM

Identifying Comparative Sentences in Text Documents

PsyDok: electronic full-text archive for psychological documents

Text Classification from Labeled and Unlabeled Documents using EM

Partially Supervised Classification of Text Documents

Text Mining of Medical Documents