STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

Text Stream • Textual Data Explosion • Emails, news, messages, broadcasts, … • Daily, hourly, minutely • Urgent need for efficient processing and analysis • Visualization is an effective approach • Text stream • Text collections constantly evolve with continuously new incoming documents • Keywords/topics not known in advance

Challenges to Visual Exploration • Temporal evolution • Existing topics • Emerging topics • Their relations • Clusters and Outliers • No collection pre-scanning or presumably priori knowledge • Live processing required • In contrast to traditional text database • Flexible user interaction for changing and adjusting • information seeking focus/preference • Process large volumes of texts in real time

SREAMIT System • Dynamic force-directed simulation • Naturally handle continuously inserted documents • Continual evolvement • Continuous depiction and analysis of growing document collections • Automatic grouping and separating • No time window used • No abrupt change • Dynamic processing • Keyword vectors dynamically updated • No prerecorded scan

SREAMIT System (continued) • Interactive exploration • Live adjustment of visualization parameters • Dynamic keyword importance • Present the significance of a keyword at a certain time • Reflect changing user demand and interest • Scalable optimization • Fast computing • GPU acceleration • Animation and interaction • Easy user control and interaction tools

Related Work • Multidimensional scaling (MDS) & projection : • IN-SPIRE 99, InfoSky 02, Hipp 08, Exemplar-based 09 • Temporal data trends • ThemeRiver 02, LensRiver 07, T-scroll 07, Meme-tracking 09, Themail 06, Topic-based 09 • Text streams • TextPool 05, Moving time window Wong03, Eventriver 10, Text pipe 05 • Force-based placement • Graph drawing 91, Chalmers96, Morrison02, etc.

System Overview

Potential and Similarity • Potential energy between pairs of document particles • α is a control parameter • liand ljarelocations of particle iandj • lij is the ideal distance of them • Ideal distance computed from document similarity • Cosine similarity • Large similarity leads to smaller ideal distance, move documents closer to form clusters

Force-directed Model • Global potential function • Forces computed from minimization • Attract or repulse document particles

DYNAMICKEYWORD IMPORTANCE • Cosine similarity can be improved by introducing importance • ImportanceIk freely modified by users at any time • According to interest/preference • According to discovered knowledge from prior period • A powerful tool for users to manipulate layout and analyze data • Importance might be changed from automatic scheme • E.g. for keyword k, • Ok: occurance; • tek:last time it appears; tsk: first time it appears; • nk: the number of documents that contain the keyword

Visualization Interface

Visualization Tools • Main window • Major layout • Animation Control Panel • Play, pause, stop • Drag by mouse • Keyword table • Dynamic update • Change importance • Document table • Text information

Labeling • Use text document titles • Reduce cluttering • Recent semantic titles • User controlled clutter levels • Group title label • Use color and opacity to display clear layout

User Interaction • Adjusting Keyword Importance • Grouping and Tracking Documents • Halo for interested topics • Browsing and Tracking Keywords • Selection • Manual, example-based, keyword-based • Integrated shoebox for details

Case Study: New York Times News • Total article number: 230 • Time period Jul. 19 and Sep. 18, 2010 • About Barack Obama • Articles continuously injected, new keywords added to the keyword table, and their frequencies are updated on-the-fly • Keyword importance automatically assigned

Case Study: New York Times News 136 news articles High frequency keywords: “Politics and Government”, “International Relations”, “Terrorism” Increase the importance of “International Relations” All documents are shown “Terrorism” becomes larger, and one item (outlier) between “Afghanistan War” and “Terrorism” Highlight the group with “Afghanistan War” in pink halo (2) “Terrorism” in orange halo (3)

Case Study: US NSF Award Abstracts • 1000 National Science Foundation (NSF) IIS award abstracts • Funded between Mar. 2000 and Aug. 2003 • Each document characterized by a set of keywords • Size of a document circle represents funding amount

Case Study: US NSF Award Abstracts Mar. 15, 2002,672 projects; many large projects started; Highlight “Sensor” with halo; (2) is an outlier far away from the other projects with halo It is about “just-in-time information retrieval on wearable computers” Aug. 1, 2000 95 projects Sep. 1, 2000,172 projects; many large projects started; Highlight “Management” in red and “Database” in green; Increase their importance

Case Study: Video on NSF Dataset

PerformanceOptimization • Initial positions of document particles affect computational steps and cost • Similarity Grid • New documents roughly inserted within the proximity of similar documents • Each grid cell has a special keyword vector consisting of the average keyword weights from the documents inside the cell • data set of 7100 documents

PerformanceOptimization • GPU acceleration • CUDA implementation of the N-body problem • Good performance achieved • NVidia Quadro NVS 295 GPU with 2GB texture memory • Intel Core2 1.8GHz CPU with 2GB RAM

GPU Performance • Experiments with 50 by 50 grid • Achieve good average speed • More importantly, maximum simulation time after document insertion on the GPU was less than a second • Fast for human perception and analysis

Discussion • The system has the ability to handle live text streams with document arrival interval around 1 second • On consumer PC and graphic card • E.g., New York Times news has an averaging 3 documents per hour and a maximum 8 documents per hour at the peak time • A very large number of documents inside the system will undoubtedly introduce visual clutters and hinder the ingestion of analyzers • Natural perception limit and device limit • Clutter reduction and simplification algorithms needed • Further increase the power • Advanced hardware • Hierarchical or multiple-resolution simulation

Conclusion • STREAMIT: An efficient visual exploration system for live text streams • Dynamic physical system • Keyword manipulation with importance • Visual tools • Acknowledgment: • National Science Foundation IIS-0915528, IIS-0916131 and NSFDACS10P1309.

Thanks! Questions!

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

Presentation Transcript

EGGSViz : Visualization and Exploration of Gene Clusters

Document (Text) Visualization

Exploration and Visualization of Oil Reservoir Simulation Data

STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams

Text Visualization

Dynamic Reduction of Query Result Sets for Interactive Visualization

Interactive Parallel Data Visualization and Exploration

Text Streams for PlanetData

Interactive visualization for opportunistic exploration of large document collections

Interactive Visualization of Large Graphs and Networks

Interactive visualization of statistical information

Visualization and Analysis of Text

Interactive Visualization

Interactive Exploration of Typed Networks

Reading interactive text

Dynamic Visualization of Transient Data Streams

Interactive Exploration of Multidimensional Data

Efficient Visualization of Document Streams

Data Visualization & Exploration

Data exploration and visualization