190 likes | 198 Views
TIARA (Text Insight via Automated Responsive Analytics) is a visual exploratory text analytic system that combines text analytics and interactive visualization to help users explore and analyze large collections of text documents. It integrates unsupervised learning methods, topic analysis, topic ranking, keyword-based topic summarization, and time-sensitive keyword extraction techniques. The system supports effective exploratory text analysis and offers features like completeness and distinctiveness evaluation for keyword extraction. Future work includes adding sentence-based summaries, supporting other languages, and improving performance.
E N D
TIARA: A Visual Exploratory Text Analytic System Presenter : Wei-Hao Huang Authors : FuruWei, ShixiaLiu, YangqiuSong, ShimeiPan Michelle X. Zhou, WeihongQian, Lei Shi, Li Tan Qiang Zhang SIGKDD 2010
Outlines • Motivation • Objectives • Methodology • Experiments • Conclusions • Comments
Motivation • The large collection of text to locate needed information or simply deciding is very costly and time-consuming. • Although a number of text analysis technologies are often abstract and complex, may not be consumable by users.
Objectives • To present exploratory visual analytic system called TIARA (Text Insight via Automated Responsive Analytics). • To combine text analytics and interactive visualization to help users explore and analyze large collections of text. Documents TIARA System
Methodology • TIARA • Topic Analysis • Topic Ranking • Keyword based Topic Summarization • Time-sensitive Keyword Extraction
TIARA System architecture Database File system
Topic Analysis To use unsupervised learning methods. is the number of Document is word of Document is vocabulary of size K is the number of topic is document-topic distribution matrix is topic-word distribution matrix Term frequencies in each cluster
Topic Ranking Topic rank is measured by a combination of both topic content coverage and topic variance.
Experiments • Time-sensitive keyword extraction procedure • Completeness • Distinctiveness • Response Time • Data set: • A personal email collection with 8326 email messages. • Emergency room data set containing 23,501 patient records.
Completeness Defined as whether we can recover the original keywords of a topic by combining the keywords associated associated with each time segment.
Distinctiveness Defined as whether we can distinguish one topic segment from another based on their associated keywords to avoid redundancy.
Conclusions • TIARA tightly integrates text analytics with interactive visualization to support effective exploratory text analysis. • Future work • Add sentence-base summaries • Support other languages • Improve performance
Comments • Advantages • To explore and analyze large text collections with interactive visualization • Applications • Text mining