1 / 23

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data. Thanawin Rakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans. Subsequence Clustering Problem. Given a time series, individual subsequences are extracted with a sliding window.

manning
Download Presentation

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Time Series Epenthesis:Clustering Time Series StreamsRequires Ignoring Some Data ThanawinRakthanmanon Eamonn Keogh Stefano Lonardi Scott Evans

  2. Subsequence Clustering Problem • Given a time series, individual subsequences are extracted with a sliding window. • Main task is to cluster those subsequences. Sliding window

  3. Subsequence Clustering Problem All subsequences Sliding window Keogh and Lin in ICDM 2003. Subsequence clustering is meaningless. Centers of 3 clusters Average subsequence

  4. All data also contains .. Transitions (the connections between words) • Some transitions has good meaning and worth to be discovered • The connection inside a group of words • Some transitions has no meaning/structure • ASL: hand movement between two words • Speech: (un)expected sound like um.., ah..,er.. • Motion Capture: unexpected movement • Hand Writing: size of space between words

  5. How to Deal with them? Possible approaches are • Learn it! • Separate noise/unexpected data from the dataset. • Use a very clean dataset • dataset contains only atomic words. • Simple approach (our choice) • Just ignore some data. • Hope that we will ignore unimportant data.

  6. Concepts in Our Algorithm Our clustering algorithm .. • is a hierarchical clustering • is parameter-lite • approx. length of subsequence (size of sliding window) • ignores some data • the algorithm considers only non-overlapped data • uses MDL-based distance, bitsave • terminates if .. • no choice can save any bit (bitsave ≤ 0) • all data has been used

  7. Minimum Description Length (MDL) • The shortest code to output the data by JormaRissanen in 1978 • Intractable complexity (Kolmogorovcomplexity) Basic concepts of MDL which we use: • The betterchoice uses the smaller number of bits to represent the data • Compare between different operators • Compare between different lengths

  8. How to use Description Length? 250 H A A' is A given H • A' = A|H = A-H A'denoted as 0 250 0 50 100 150 200 If > , we will store Aas A'and H DL(A') + DL(H) DL(A) DL(A) is the number of bits to store A

  9. Clustering Algorithm Current Clusters Merge clusters Create new cluster Add to cluster Create a cluster by 2 subsequences which are the most similar. Add the closet sub-sequence to a cluster. Merge 2 closet clusters.

  10. What is the best choice? bitsave= DL(Before) - DL(After) • Create bitsave= DL(A) + DL(B) - DLC(C') - a new cluster C' from subsequences A and B • Add bitsave= DL(A) + DLC(C) - DLC(C') - a subsequence A to an existing cluster C - C' is the cluster C after including subsequence A. • Merge bitsave= DLC(C1) + DLC(C2) - DLC(C') - cluster C1 and C2merge to a new cluster C'. The bigger save, the better choice.

  11. Clustering Algorithm Current Clusters Merge clusters Create new cluster Add to cluster Create a cluster by 2 subsequences which are the most similar. Add the closet sub-sequence to a cluster. Merge 2 closet clusters.

  12. Bird Calls Two interwoven calls from the Elf Owl, and Pied-billed Grebe. 0 5 0.5 1 1.5 2 2.5 3 x 10 A time series extracted by using MFCC technique.

  13. Input Motif Discovery Create Nearest Nighbor Add Create Add Add Merge Final Clusters Current Clusters

  14. Bird Calls: Clustering Result Center of cluster (or Hypothesis) Subsequences Step 1: Create Step 2: Create Step 3: Add Step 4: Merge 2 Create 0 bitsave per unit Add -2 Merge Clustering stops here -4 1 2 3 4 Step of the clustering process

  15. Poem The Bells In a sort of Runic rhyme, To the throbbing of the bells-- Of the bells, bells, bells, To the sobbing of the bells; Keeping time, time, time, As he knells, knells, knells, In a happy Runic rhyme, To the rolling of the bells,-- Of the bells, bells, bells-- To the tolling of the bells, Of the bells, bells, bells, bells, Bells, bells, bells,-- To the moaning and the groan- ing of the bells. Edgar Allen Poe 1809-1849 (Wikipedia)

  16. The Bells: Clustering Result == Group by Clusters == bells,bells, bells, Bells, bells, bells, Of the bells, bells, bells, Of the bells, bells, bells— To the throbbing of the bells-- To the sobbing of the bells; To the tolling of the bells, To the rolling of the bells,-- To the moaning and the groan- time, time, time, knells, knells, knells, sort of Runic rhyme, groaning of the bells. == Original Order == In a sort of Runic rhyme, To the throbbing of the bells-- Of the bells, bells, bells, To the sobbing of the bells; Keepingtime, time, time, As he knells, knells, knells, In a happy Runic rhyme, To the rolling of the bells,-- Of the bells, bells, bells-- To the tolling of the bells, Of the bells,bells, bells, bells, Bells, bells, bells,-- To the moaning and the groan- ing of the bells.

  17. Summary • Clustering time series algorithm using MDL. • Some data must be ignored or not appeared in any cluster. • MDL is used to .. • select the best choice among different operators. • select the best choice among the different lengths. • Final clusters can contain subsequences of different length. • To speed up, Euclidean is used instead of MDL in core modules, e.g., motif discovery.

  18. Thank you for your attention Question?

  19. Supplementary

  20. How to calculate DL? A is a subsequence. • DL(A) = entropy(A) • Similar result if use Shanon-Fano or Huffman coding. H is a hypothesis, which can be any subsequence . *DL(A) = DL(H) + DL(A - H) • Compression idea; never use • directly in algorithm Cluster C contains subsequence A and B • DLC(C) = DL(center) + min(DL(A-center), DL(B-center)) A H A' 0

  21. Running Time Koshi-ECG time series Cluster plotted Stacked, Dithered 500 1000 1500 2000 0 motif length s = 350 O(m3/s)

  22. ED vs MDL in Random Walk ED calculated in original continuous space MDL calculated in discrete space (64 cardinality)

  23. Discretizationvs Accuracy • Classification Accuracy of 18 data sets. • The reduction from original continuous space to different discretizationdoes not hurt much, at least in classification accuracy.

More Related