Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets

Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets Dragomir Yankov, Eamonn Keogh, Computer Science & Eng. Dept. University of California, Riverside Umaa Rebbapragada Dept. of Computer Science Tufts University Best paper winner: ICDM 2007

Outline • What inspired the current work • The time series discord detection problem • An efficient algorithm for mining disk resident discords • Detecting range-based discords • Detecting the top k discords • Experimental results • Evaluating the effectiveness of the discord definition • Scalability of the discord detection algorithm

A motivating example • Myriads of telescopes around the world constantly record valuable astronomical data, e.g. star light-curves Click Image to Play • A light-curve is a real-valued time series • of light magnitude measurements • derived from telescopic images Eclipsed binary: Sirius A&B Movie: By kind permissions of Prof. Richard W. Pogge, OSU Image: Chandra X-ray observatory

A motivating example (cont) • The American Association of Variable Star Observers has a database of over 10.5 million variable star brightness measurements going back over ninety years • Over 400,000 new variable star brightness measurements are added to the database every year • Many of the observations are noisy or are preprocessed inaccurately prior to storing • Efficient, unsupervised methods for cleaning the data are required

A motivating example (cont) • Data are inherently non-convex and hard to model probabilistically. • Anomalies should be • defined with respect to • the non-linear manifolds • defined by the light- • curve time series (true • for many time series • datasets)

Definitions and assumptions • Notation • time series: • subseqence: • time series database: • Function (may not be a metric) defines an ordering for the elements in Nasdaq Composite (Oct06-Oct07)

Time series discords • Most-significant discord – the subsequence with maximal distance to its nearest neighbor

Generalized discord definitions • Most-significant k-th NN discord – the subsequence with maximal distance to its k-th nearest neighbor

Generalized discord definitions • Most-significant k-NN discord – the subsequence with maximal distance to its k nearest neighbors in The algorithm utilizes the first of these discord definitions for its computational efficiency and intuitive interpretation

Disk aware discord detection • Detecting discords is harder than finding similar patterns • anytime algorithms can quickly detect similarities • anomalies require computation time • Indexing is not a solution • time series are high dimensional • dimensionality reduction is often inadequate • linear scan is faster than 10% random disk accesses We are looking for an algorithm that performs two disk scans and “approximately linear” number of computations

Discord detection algorithm • Phase 1 – candidates selection phase … … - discord range

Discord detection algorithm • Phase 2 – candidates refinement phase … … ? … … - discord range

Discord detection algorithm • Phase 2 – candidates refinement phase … … … … - discord range

Discord detection algorithm • Phase 2 – candidates refinement phase … … Upon completion sort the candidates list C … …

Correctness of the algorithm • The candidates set C contains all discords at distance at least r from their NN, plus some other elements • The refinement phase removes from C all false positives, and no real discord is pruned • Correctness: the range discord algorithm detects all discords and only the discords with respect to the specified range r

Finding a good range parameter • Selecting large r may result in an empty discord set, while too small r can render the algorithm inefficient • Computing the nearest neighbor distance distribution (NNDD) is expensive • NNDD depends on the number of examples in the data

Approximating NNDD • Intuition – though the relative volume in the upper tail decreases, the absolute number of discords cut by r remains sufficient when adding more data • Detecting the top k discords • Select a uniformly random sample • Compute the top k discords in • Order their NN distances as: • Set • Run the disk aware algorithm with range parameter

Experimental evaluation We performed two sets of experiments • Experiments showing the utility of the time series discord definition • Experiments showing the scalability of the disk aware discord detection algorithm

Experimental evaluation - utility of the discord definition • Star light-curve data from the Optical Gravitational Lensing Experiment (OGLE) • Three classes of light-curves • Eclipsed binaries • Cepheids • RR Lyrae variables typical examples top two discords in each class

Experimental evaluation -utility of the discord definition • MSN web queries made in 2002 • The most significant discord using rotation invariant Euclidean distance patterns dominated by a weekly cycle anticipated bursts periodicity 29.5 days – the length of a synodic month

Experimental evaluation -utility of the discord definition • Anomaly detection in video sequences (multivariate data) • Adapting the method as a data cleaning procedure the top one discord shown with only one of the existing clusters our method achieves 100% accuracy on the planted anomalous trajectories

Experimental evaluation -utility of the discord definition • Population growth data – we studied the growth rate of 206 countries for the last 25 years, looking for the most dramatic 5 year event the top 2 discords with a set of 10 representative countries for contrast

Experimental evaluation –scalability of the disk aware algorithm • We generated 3 data sets of size up to 0.35Tb of random walk time series • Six non-random walk time series were planted, we looked for the top 10 discords • Time efficiency on the three random walk data sets: two of the planted series (top) were among the top 10 discords

Experimental evaluation –scalability of the disk aware algorithm • Time efficiency (Heterogeneous data): • Main memory requirement for different thresholds

Experimental evaluation –scalability of the disk aware algorithm • Parallelizing the algorithm (m computers): … … Candidate selection phase Candidate refinement phase

Experimental evaluation –scalability of the disk aware algorithm • Parallelizing the algorithm (dataset: one million random walks ): The runtime overhead for 8 computers is approximately 30%. This is due to the increased candidate set size |C| at the end of phase 1

Conclusion • Discords provide for an effective definition of rare time series patterns. • The presented disk aware algorithm has all requirements of a good off-the-shelf data mining tool: • The results are interpretable • It is extremely efficient and largely scalable • Very easy to implement (“8 lines in Matlab”) • Allows for straight-forward parallel and online extensions

Acknowledgements • We would like to thank to: • Dr. Pavlos Protopapas (Harvard University) – light-curve dataset • Dr. Michail Vlachos (IBM Watson) – MSN web query data • Dr. Longin Jan Latecki (Temple University) – Trajectory dataset1 • Dr. Andrew Naftel (University of Manchester) - Trajectory dataset2 also • Dr. Jessica Lin (George Mason University) and • Dr. Ada Fu (Chinese University of Hong Kong) – for useful discussions

All datasets and the code can be downloaded from: http://www.cs.ucr.edu/~dyankov/projects/ THANK YOU!

Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets