370 likes | 488 Views
T-Scroll : Visualizing Trends in a Time-Series of Documents for Interactive User Exploration. Yoshiharu Ishikawa and Mikine Hasegawa Nagoya University, Japan ishikawa@itc.nagoya-u.ac.jp. Outline. Background and objective Related work Novelty-based document clustering
E N D
T-Scroll: Visualizing Trends in a Time-Series of Documents for Interactive User Exploration Yoshiharu Ishikawa and Mikine Hasegawa Nagoya University, Japan ishikawa@itc.nagoya-u.ac.jp
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
Background • Time-series of documents • Example: news articles delivered on the Internet, online academic journals • Continually delivered everyday • Problems • A large number of documents: appropriate summarization is required • Topics will change: topic detection/tracking and trend extraction are useful
Objectives • Development and evaluation of T-Scroll (Trend/Topic-Scroll) • User interface for visualizing the transition of topics extracted from a time-series documents • System Features • Constructed over a document clustering system that outputs new clustering results periodically • Clusters are displayed along the time axis like a scroll • Links are shown between related clusters to represent topic transition • Some useful features for interactive exploratory analysis
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
Visualization of a time-series of documents • A few systems for visualization of trends in a time-series of documents • ThemeRiver (Havre et al, IEEE Trans. VCG, 2002) [4] • Visualizes topic streams like a river • Focuses on providing visual impacts • No features for analysis and browsing • TimeMine (Swan and Allan, SIGIR’00) [5] • Extracts topics from a time-series of documents • Displays timelines to represent topics on the screen
ThemeRiver Analysis of the articles related to Cuba (1960 – 1961)
TimeMine • Swan & Allan (U. of Massachusetts)
Analysis of time-dependent clusters • Mei & Zhai (KDD’05) [6] • Statistical approach for discovering major topics from a time-series of documents • Probabilistic modeling • MONIC (Spiliopoulou et al., KDD’06) [7] • Detects various types of patterns from cluster transitions • Examples: splitting/merging of clusters, cluster size changes • Based on the analysis of historical snapshots of clusters
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
Novelty-based document clustering (1) • Developed by our group (ECDL’01 [8], WWW Journal 2007 [10] etc.) • Clusters documents incrementally based on their similarity and novelty • Features • Similarity considers novelty • Assign high weights to recent documents, low weights to old ones • Document weights decay as time passes: Based on the concept of obsolescence (aging) • Delete old documents whose weights are smaller than the threshold • Incremental processing: low update cost
Novelty-based document clustering (2) • Periodical clustering processes are performed on a time-series of documents “Yeltsin’s Death” and other documents are obsolete! Blair to Resign New President Sarkozy Yeltsin’s Death Other articles time
Document similarity (1) dwi 1 t t Ti Current time acquisition time of document di • Assumption: each delivered document gradually loses its value as time passes • dwi: the weightof a documentdi at time • (0 < < 1): forgetting factor determines the forgetting speed • The weight of a document exponentially decreases as time passes.
Document similarity (2) • Similarity score of documents di and dj • Based on novelty of documents and word occurrence patterns in the documents. • Extension of the tf-idf method • New documents have high impact on the clustering result • Document clustering: k-means method
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
T-Scroll: Idea • Periodical clustering results are displayed like a scroll • Links represents related cluster pairs
System functionalities (1) • Cluster labels: selected based on the formula • Pr(di): document weight, tfij: term frequency count • Cluster sizes: ellipse size roughly corresponds to the number of documents • Links: If the score is greater than the threshold, links are shown
System functionalities (2) • Cluster quality: visualized using different colors for the cluster border lines • red (good) purple (bad) • High score can be achieved if (1) the cluster size is large, and (2) documents contained in the cluster are similar
System functionalities (3) • Drill-down/roll-up: user can specify the interval of between two consecutive clustering interactively (e.g., one day, one week) • Displaying keyword list: user can browse the keyword list for a specified cluster • Access to original documents • Keyword-based emphasis: clusters that contain a user-specified keyword are emphasized
System implementation • T-Scroll module • Written by Perl: generates an SVG file • Browser displays the generated SVG file • SVG file includes scripts (JavaScript) • Used for interactive manipulation • Clustering module • Written by Ruby • Novelty-based incremental document clustering
System architecture ClusteringModule RSSFeedModule ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- Input Output Clustering result News articles Input T-Scroll Main Module SVGOutput Module Command inputs (Perl) Interactivemanipulation (JavaScript) Outputs ------- ------- ------- SVGControlModule Plug-in Cluster display Browser User (Perl) SVG file (includes JavaScript) T-Scroll
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
Evaluation • 10 Users • Data set • Japanese news articles collected from news web sites from Sept. 2006 to Feb. 2007 • 100 articles per day • Clustering was performed at six-hour intervals • Evaluation criteria • Overall impressions • Evaluation of each function • Obervability of topics • Comparison with ThemeRiver
Overall impression • User specifies scores between 0 to 5
Observability of topics (1) • Can users observe major topics in Nov. 2006? • Five major topics are specified by ours: user gives scores how clearly he or she can observe the topic
10 users (different from former experiments) Users should reply observed topics and their scores with no information Topics 1 to 5 are major topics used in the previous experiments Topic 2 (big hurricane) was regarded as a normal weather topic Observability of topics (2)
Comparison with ThemeRiver (1) • ThemeRiver-like display figure was manually created for news articles in Dec. 2006 • 11 users (different from previous experiments) • Questions to users • Overall impressions • Obserbability of topics
Comparison with ThemeRiver (2) • Overall impression
Comparison with ThemeRiver (2) • Can users observe five major topics that we selected?
Summary of experiments • Overall impressions • Good, but improvements required for usability • Some users made comments on the response speed • System functionalities • Several features (quality info, article lists, etc.) are useful in practice • Appropriate labels are necessary: should be improved • Comparison with ThemeRiver • ThemeRiver has visual impacts, but its display tends to be complicated for many topics
Outline • Background and objective • Related work • Novelty-based document clustering • Overview of T-Scroll system • Evaluation • Conclusions and future work
Conclusions and future work • Development and evaluation of T-Scroll system • Based on novelty-based incremental clustering method • Scroll-like display for showing changing trends • Several features for interactive analysis • Evaluation • Overall impression • Observability of topics • Comparison with ThemeRiver • Future work • Sophisticated keyword (label) selection • Improvement of interactive speed