230 likes | 243 Views
This agreement allows for the use of slides on the CD-Rom for teaching and conference presentations. Please read carefully for the terms and conditions.
E N D
Fair Use Agreement • This agreement covers the use of all slides on this CD-Rom, please read carefully. • You may freely use these slides for teaching, if • You send me an email telling me the class number/ university in advance. • My name and email address appears on the first slide (if you are using all or most of the slides), or on each slide (if you are just taking a few slides). • You may freely use these slides for a conference presentation, if • You send me an email telling me the conference name in advance. • My name appears on each slide you use. • You may not use these slides for tutorials, or in a published work (tech report/ conference paper/ thesis/ journal etc). If you wish to do this, email me first, it is highly likely I will grant you permission. • (c) Eamonn Keogh, eamonn@cs.ucr.edu KDD ‘04
Visually Mining and Monitoring Massive Time Series Jessica Lin* Eamonn Keogh Stefano Lonardi (UC Riverside) Jeffrey Lankford Donna Nystrom (The Aerospace Corp) KDD ‘04
Motivation • Before the launch of any unmanned space vehicle, a critical “go/no go” decision must be made. • Data from past launches is available to assist in the decision-making. • Streaming telemetry must be constantly monitored to detect any potential problems. • A single framework is needed to perform these two tasks. • Existing tools inadequate for such tasks. KDD ‘04
Introduction • We introduce VizTree • Mining archival data • Pattern discovery • repeated pattern discovery (motif discovery), • anomaly detection, • query-by-content • Monitoring incoming streaming data • Why visualization? • human eye is often advocated as the ultimate data-mining tool • User-interaction allows visual data exploration and hypotheses testing KDD ‘04
Outline • Introduction • Related Works • VizTree Motivation • VizTree Implementation • Time Series Discretization • Experimental Evaluation • Diff Tree • Discussion/Conclusion KDD ‘04
Sunday Friday Saturday Thursday Monday Tuesday Wednesday Related Work 1: Time Series Spirals One year of power demand data • Spiral Axis = serial attributes are encoded as line thickness • Radii = periodic attributes Carlis & Konstan. UIST-98 Independently rediscovered by Weber, Alexa & Müller InfoVis-01 But dates back to 1888! KDD ‘04
Related Work 2: TimeSearcher • Comments • Simple and intuitive • Highly dynamic exploration • Query power may be limited and simplistic • Limited scalability Hochheiser, and Shneiderman KDD ‘04
Related Work 3 – Calendar-based The cluster and calendar-based visualization on employee working hours data. It shows six clusters, representing different working-day patterns. KDD ‘04
Motivation of VizTree 10001000101001000101010100001010100010101110111101011010010111010010101001110101010100101001010101110101010010101010110101010010110010111011110100011100001010000100111010100011100001010101100101110101 01011001011110011010010000100010100110110101110000101010111011111000110110110111111010011001001000110100011110011011010001011110001011010011011001101000000100110001001110000011101001100101100001010010 Here are two sets of bit strings. Which set is generated by a human and which one is generated by a computer? KDD ‘04
VizTree 10001000101001000101010100001010100010101110111101011010010111010010101001110101010100101001010101110101010010101010110101010010110010111011110100011100001010000100111010100011100001010101100101110101 01011001011110011010010000100010100110110101110000101010111011111000110110110111111010011001001000110100011110011011010001011110001011010011011001101000000100110001001110000011101001100101100001010010 0 0 1 0 0 1 1 1 Lets put the sequences into a depth limited tree, such that the frequencies of all triplets are encoded in the thickness of branches… KDD ‘04 “humans usually try to fake randomness by alternating patterns”
VizTree Zoom in The “trick” on the previous slide only works for discrete data, but time series are real valued. Details 2 But we can SAX up a time series to make it discrete! Overview Details 1 • VisTree • Convert the time series to SAX • Push the data in a depth-limited suffix tree • Encode the frequencies as the line thickness Overview, zoom & filter, details on demand KDD ‘04
SAXSymbolic Aggregate ApproXimation baabccbc KDD ‘04
C C 0 20 40 60 80 100 120 a a a b b b - - 0 0 40 60 80 100 120 c 20 c How do we obtain SAX? First convert the time series to PAA representation, then convert the PAA to symbols It take linear time bccbaaba KDD ‘04
Visual Comparison 3 2 DFT a 1 b c PLA 0 d e - 1 f Haar - 2 APCA - 3 • A raw time series of length 128 is transformed into the word “aaaaaabbbccdeffdcbbdcdefffffdccbb.” • We can use more symbols to represent the time series since each symbol requires fewer bits than real-numbers (float, double) KDD ‘04
Zoom in Subsequence Matching/Motif Dicovery Ben Shneiderman Overview, zoom & filter, details on demand This example demonstrates subsequence matching and motif discovery. We want to find a U-shaped pattern, so we’d try something that starts high, descends, and then ascends again. Clicking on “abdb” shows such patterns.
Motif Discovery Clicking on “abxx” shows this repeated patterns KDD ‘04
Anomaly Detection 1 Clicking on the branch “acxx” shows the anomalous heartbeat KDD ‘04
Anomaly Detection 2 Clicking on “bab” shows the anomalous week (Christmas). Instead of a normal 5-working-day week, it has 3-working day during Christmas. KDD ‘04
Diff Tree • DiffTree • Convert the two time series to SAX • Push the data in a depth-limited suffix tree • Encode the difference of frequencies as the line thickness • Encode the significance of difference as the line color intensity • Rank the surprising patterns Blue lines - pattern is more common in A Green lines - pattern is more common in B Red lines – surprising patterns KDD ‘04
Diff Tree 2 KDD ‘04
Scalability • The pixel space of the tree is determined solely by the number of segments and alphabet size. • Constant and independent of the size of time series • Size of the dataset plays a role in memory space, since each node in the tree stores the offsets of its subsequences. However, SAX allows efficient numerosity reduction to reduce the number of subsequences being included into the tree • large amounts of dimensionality reduction do not greatly affect the accuracy of our results (for the power dataset, the dimensionality is reduced from 672 to 3, a compression ratio of 224-to-1). KDD ‘04
Conclusion • We propose VizTree, a novel time series visualization tool for pattern discovery. • Frequently occurring patterns • Anomaly detection • Query-by-content • A single framework that allows both mining of the archival data and monitoring of streaming data. • Highly scalable. KDD ‘04
Future Work • Researchers from other sectors of the industry can greatly benefit from our system as well. • it could potentially be used for indexing and editing video sequences. • Problems that can be indirectly solved: • Subsequence Clustering • Time Series Rule Discovery • While we mainly focus on the “mining” aspect in this paper, we will extend VizTree to accept online streaming data for monitoring purposes. KDD ‘04