210 likes | 420 Views
Finding Time Series Motifs on Disk-Resident Data. Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely-Shamlo Swartz Center for Computational Neuroscience, UCSD. Outline. Motivation Time Series Motif DAME : Disk-Aware Motif Enumeration Performance Evaluation
E N D
Finding Time Series Motifs on Disk-Resident Data Abdullah Mueen, Dr. Eamonn Keogh UC Riverside Nima Bigdely-Shamlo Swartz Center for Computational Neuroscience, UCSD
Outline • Motivation • Time Series Motif • DAME: Disk-Aware Motif Enumeration • Performance Evaluation • Speedup and Efficiency • Case Studies • Motifs in Brain-Computer Interfaces • Motifs in Image Database • Conclusion
Sequence Motif • Repeated Pattern in a sequence . • A Pattern can be approximately similar. • Mismatch is allowed • A Pattern can be overlapping. 2 1 0 -1 GACATAATAACCAGCTATCTGCTCGCATCGCCGCGACATAGCT -2 40 60 80 100 120 140 160 180 200 20 Motion Motif Structural Motif Time Series Motif
Time Series Motif • Repeated Pattern in a Time Series. • Exact Motif. • The most similar pair under Euclidean Distance. • Non Overlapping. • Euclidean distance (between normalized segments) • Beats most similarity measures on large datasets. • Early abandoning. • Triangular inequality. • d(P,Q) ≥ |d(P,R) - d(Q,R)| 2 1 0 -1 -2 0 10 20 30 40 50 60
Motif Discovery in Disk-Resident Datasets • Large datasets • Light Curves of Stars. • Performance Counters of Data Centers. • Pseudo time series dataset • “80 million Tiny Images” • Database of normalized subsequences • An hour long trace of EEG generates over one million normalized subsequences.
Geometric View Disk View Blocks 19 1 2 3 4 5 6 9 10 12 4 14 7 8 9 24 15 7 16 1 10 11 12 5 6 11 13 14 15 8 3 17 22 23 20 16 17 18 13 19 20 21 2 21 DAME 18 22 23 24 Set of 2D points
Geometric View Projected View Disk View 1 5 18 19 Blocks 19 1 5 14 3 15 17 9 10 12 4 14 8 10 22 24 15 7 16 1 11 4 12 0 5 6 11 9 7 24 8 3 17 22 23 20 6 2 13 13 20 21 23 2 21 DAME 18 16 1819 Linear Representation in sorted order 0 is the reference point
Geometric View Projected View Projected View Disk View 1 5 18 19 Blocks 19 1 5 14 3 15 17 9 10 12 4 14 8 10 22 24 15 7 16 1 11 4 12 0 5 6 Best 1 11 9 7 24 8 3 17 22 23 20 6 2 13 13 20 21 23 2 21 DAME 18 16 1819 Best 2 Divide the point-set into two partition and solve the subproblem
Geometric View Projected View Projected View Disk View 1 5 18 19 Blocks 19 1 5 14 3 15 17 9 10 12 4 14 8 10 22 24 15 7 16 1 11 4 12 0 5 6 Bsf 11 9 7 24 8 3 17 22 23 20 6 2 13 13 20 21 23 2 21 DAME 18 16 1819 The inner ring is the region for blocks 5 and 6 Blocks of Interest The outer ring is the region for blocks 3 and 4
Block-Pair (3,5) Block-Pair (3,6) Block-Pair (4,5) Block-Pair (4,6) Block 3 and block 6 do not overlap. No comparison. 9 comparisons 1 comparison 1 Comparison 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 bsf 5 5 5 5 6 6 6 6 7 7 7 7 DAME 8 8 8 8 No Comparison Loaded Blocks 11 comparisons are made instead of 9*16=144
Speedup Memory Disk X √ X √ √ √ √ X
Performance Evaluation x 103 x 103 12 10 Seconds in DAME_Motif 11 Seconds in DAME_Motif 9 10 Total 8 9 8 7 7 CPU 6 6 5 5 4 4 I/O 3 3 2 0 200 400 600 800 1000 1200 Motif Length 10,000 20,000 30,000 40,000 50,000 1,000 500 34 25 20 # of time series # of blocks
Case Study 1: Brain-Computer Interfaces Biosemi, Inc. Target Non-Target
Before target presentation After target presentation IC 17, Motif 1 110 100 90 80 Epochs 70 60 50 40 30 20 10 -1000 -500 0 500 1000 Latency Case Study 1: Brain-Computer Interfaces 22 50 Target Trials 20 18 100 16 14 150 12 200 10 Non-target Trials 8 250 6 Spatial filter (ICA) 300 4 -1000 -800 -600 -400 -200 0 200 400 600 800 Time (ms) 3 Segment 1 Motif 1 Segment 2 2 1 Normalized IC activity Distance to Motif 1 0 -1 -2 0 100 200 300 400 500 600 Time (ms)
Case Study 2: Image Motifs • Concatenated color histogram is considered as pseudo time series. • Each time series is of length 256*3 = 768. • 80 million tiny images of 32X32 resolution. 12 10 8 6 4 2 0 -2 0 100 200 300 400 500 600 700 80 million tiny images : collected by Antonio Torralba, Rob Fergus, William T. Freemanat MIT.
Case Study 2: Image Motifs • DAME worked on the first 40 million time series in ~6.5 days • DAME found 3,836,902 images which have at least one duplicate. • 1,719,443 unique images. • 542,603 images have near duplicates with distance less than 0.1. Duplicate Image Near Duplicate Image 23277616 23277667 15513839 15513780 31391181 6791228 38468056 11896606 32751032 17012103 2495 21298 2477 21280 3245 21891 3305 22166 2553 21371
Conclusion • DAME: The first exact-motif discovery algorithm that finds motif in disk-resident data. • DAME is scalable to massive datasets of the order of millions of time series. • DAME successfully finds motif in EEG traces and image databases.
Example of Multidimensional Motif Motion-Motif Top view of the dance floor and the trajectories of the dancers. Dance Motions are taken from the CMU Motion Capture Database
Example of Worst Case Scenario 14 15 13 5 6 4 7 16 12 r 3 8 11 17 2 9 1 10 18
Multiple References for Ordering Larger Gap 40 30 20 Planar bounds 10 Actual distances 0 Linear bounds 10 20 30 Smaller Gap 40 Lower bound x y y x Rotational axis r1 r2 r1 r2