260 likes | 454 Views
STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams. Text Stream. Textual Data Explosion Emails, news, messages, broadcasts, … Daily, hourly, minutely Urgent need for efficient processing and analysis Visualization is an effective approach Text stream
E N D
STREAMIT: Dynamic Visualization and Interactive Exploration of Text Streams
Text Stream • Textual Data Explosion • Emails, news, messages, broadcasts, … • Daily, hourly, minutely • Urgent need for efficient processing and analysis • Visualization is an effective approach • Text stream • Text collections constantly evolve with continuously new incoming documents • Keywords/topics not known in advance
Challenges to Visual Exploration • Temporal evolution • Existing topics • Emerging topics • Their relations • Clusters and Outliers • No collection pre-scanning or presumably priori knowledge • Live processing required • In contrast to traditional text database • Flexible user interaction for changing and adjusting • information seeking focus/preference • Process large volumes of texts in real time
SREAMIT System • Dynamic force-directed simulation • Naturally handle continuously inserted documents • Continual evolvement • Continuous depiction and analysis of growing document collections • Automatic grouping and separating • No time window used • No abrupt change • Dynamic processing • Keyword vectors dynamically updated • No prerecorded scan
SREAMIT System (continued) • Interactive exploration • Live adjustment of visualization parameters • Dynamic keyword importance • Present the significance of a keyword at a certain time • Reflect changing user demand and interest • Scalable optimization • Fast computing • GPU acceleration • Animation and interaction • Easy user control and interaction tools
Related Work • Multidimensional scaling (MDS) & projection : • IN-SPIRE 99, InfoSky 02, Hipp 08, Exemplar-based 09 • Temporal data trends • ThemeRiver 02, LensRiver 07, T-scroll 07, Meme-tracking 09, Themail 06, Topic-based 09 • Text streams • TextPool 05, Moving time window Wong03, Eventriver 10, Text pipe 05 • Force-based placement • Graph drawing 91, Chalmers96, Morrison02, etc.
Potential and Similarity • Potential energy between pairs of document particles • α is a control parameter • liand ljarelocations of particle iandj • lij is the ideal distance of them • Ideal distance computed from document similarity • Cosine similarity • Large similarity leads to smaller ideal distance, move documents closer to form clusters
Force-directed Model • Global potential function • Forces computed from minimization • Attract or repulse document particles
DYNAMICKEYWORD IMPORTANCE • Cosine similarity can be improved by introducing importance • ImportanceIk freely modified by users at any time • According to interest/preference • According to discovered knowledge from prior period • A powerful tool for users to manipulate layout and analyze data • Importance might be changed from automatic scheme • E.g. for keyword k, • Ok: occurance; • tek:last time it appears; tsk: first time it appears; • nk: the number of documents that contain the keyword
Visualization Tools • Main window • Major layout • Animation Control Panel • Play, pause, stop • Drag by mouse • Keyword table • Dynamic update • Change importance • Document table • Text information
Labeling • Use text document titles • Reduce cluttering • Recent semantic titles • User controlled clutter levels • Group title label • Use color and opacity to display clear layout
User Interaction • Adjusting Keyword Importance • Grouping and Tracking Documents • Halo for interested topics • Browsing and Tracking Keywords • Selection • Manual, example-based, keyword-based • Integrated shoebox for details
Case Study: New York Times News • Total article number: 230 • Time period Jul. 19 and Sep. 18, 2010 • About Barack Obama • Articles continuously injected, new keywords added to the keyword table, and their frequencies are updated on-the-fly • Keyword importance automatically assigned
Case Study: New York Times News 136 news articles High frequency keywords: “Politics and Government”, “International Relations”, “Terrorism” Increase the importance of “International Relations” All documents are shown “Terrorism” becomes larger, and one item (outlier) between “Afghanistan War” and “Terrorism” Highlight the group with “Afghanistan War” in pink halo (2) “Terrorism” in orange halo (3)
Case Study: US NSF Award Abstracts • 1000 National Science Foundation (NSF) IIS award abstracts • Funded between Mar. 2000 and Aug. 2003 • Each document characterized by a set of keywords • Size of a document circle represents funding amount
Case Study: US NSF Award Abstracts Mar. 15, 2002,672 projects; many large projects started; Highlight “Sensor” with halo; (2) is an outlier far away from the other projects with halo It is about “just-in-time information retrieval on wearable computers” Aug. 1, 2000 95 projects Sep. 1, 2000,172 projects; many large projects started; Highlight “Management” in red and “Database” in green; Increase their importance
PerformanceOptimization • Initial positions of document particles affect computational steps and cost • Similarity Grid • New documents roughly inserted within the proximity of similar documents • Each grid cell has a special keyword vector consisting of the average keyword weights from the documents inside the cell • data set of 7100 documents
PerformanceOptimization • GPU acceleration • CUDA implementation of the N-body problem • Good performance achieved • NVidia Quadro NVS 295 GPU with 2GB texture memory • Intel Core2 1.8GHz CPU with 2GB RAM
GPU Performance • Experiments with 50 by 50 grid • Achieve good average speed • More importantly, maximum simulation time after document insertion on the GPU was less than a second • Fast for human perception and analysis
Discussion • The system has the ability to handle live text streams with document arrival interval around 1 second • On consumer PC and graphic card • E.g., New York Times news has an averaging 3 documents per hour and a maximum 8 documents per hour at the peak time • A very large number of documents inside the system will undoubtedly introduce visual clutters and hinder the ingestion of analyzers • Natural perception limit and device limit • Clutter reduction and simplification algorithms needed • Further increase the power • Advanced hardware • Hierarchical or multiple-resolution simulation
Conclusion • STREAMIT: An efficient visual exploration system for live text streams • Dynamic physical system • Keyword manipulation with importance • Visual tools • Acknowledgment: • National Science Foundation IIS-0915528, IIS-0916131 and NSFDACS10P1309.
Thanks! Questions!