380 likes | 606 Views
Beyond Streams and Graphs: Dynamic Tensor Analysis. Dacheng Tao. Christos Faloutsos. Jimeng Sun. Speaker: Jimeng Sun. Motivation. Goal: incremental pattern discovery on streaming applications Streams: E1: Environmental sensor networks E2: Cluster/data center monitoring Graphs:
E N D
Beyond Streams and Graphs: Dynamic Tensor Analysis Dacheng Tao Christos Faloutsos Jimeng Sun Speaker: Jimeng Sun
Motivation • Goal: incremental pattern discovery on streaming applications • Streams: • E1: Environmental sensor networks • E2: Cluster/data center monitoring • Graphs: • E3: Social network analysis • Tensors: • E4: Network forensics • E5: Financial auditing • E6: fMRI: Brain image analysis • How to summarize streaming data effectively and incrementally?
E3: Social network analysis • Traditionally, people focus on static networks and find community structures • We plan to monitor the change of the community structure over time and identify abnormal individuals
destination source E4: Network forensics • Directional network flows • A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] • 450 GB/hour with compression • Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
Static Data model • For a timestamp, the stream measurements can be modeled using a tensor • Dimension: a single stream • E.g, <Christos, “graph”> • Mode: a group of dimensions of the same kind. • E.g., Source, Destination, Port Time = 0 Source Destination
Static Data model (cont.) • Tensor • Formally, • Generalization of matrices • Represented as multi-array, data cube.
Dynamic Data model (our focus) • Streams come with structure • (time, source, destination, port) • (time, author, keyword) Source Destination time
where n is increasing over time time … Dynamic Data model (cont.) • Tensor Streams • A sequence of Mth order tensor keyword time author … …
Dynamic tensor analysis Old Tensors New Tensor Source Destination UDestination Old cores USource
Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion
R R R Background – Singular value decomposition (SVD) • SVD • Best rank k approximation in L2 • PCA is an important application of SVD n n k k k VT A U UT m m Y
Latent semantic indexing (LSI) • Singular vectors are useful for clustering or correlation detection cluster cache frequent query pattern concept-association DM x x = DB document-concept concept-term
5 7 6 8 1 32 4 Tensor Operation: Matricize X(d) • Unfold a tensor into a matrix Acknowledge to Tammy Kolda for this slide
Tensor Operation: Mode-product • Multiply a tensor with a matrix port port source source destination destination group group source
Related work Our Work
Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion
Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant
Why do we care? • Anomaly detection • Reconstruction error driven • Multiple resolution • Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query
1st order DTA - problem Given x1…xn where each xi RN, find URNR such that the error e is small: N Y UT x1 R ? Sensors …. n time xn indoor Note that Y = XU Sensors outdoor
1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT =Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!
1st order STA • Adjust U smoothly when new data arrive without diagonalization [VLDB05] • For each new pointx • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : • yi := UiTx (proj. onto Ui) • didi + yi2 (energy i-th eigenval.) • ei := x – yiUi (error) • UiUi + (1/di) yiei (update estimate) • xx – yiUi (repeat with remainder) error Sensor 2 U Sensor 1
Mth order DTA – complexity Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation: Ni3 (or Ni2) diagonalization of C + Ni Nimatrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplicationis the main cost
x U1 updated e1 U1 y1 Mth order STA • Run 1st order STA along each mode • Complexity: • Storage: O( Ni) • Computation: Ri Niwhich is smaller than DTA
Roadmap • Motivation and main ideas • Background and related work • Dynamic and streaming tensor analysis • Experiments • Conclusion
Experiment • Objectives • Computational efficiency • Accurate approximation • Real applications • Anomaly detection • Clustering
Data set 1: Network data • TCP flows collected at CMU backbone • Raw data 500GB with compression • Construct 3rd order tensors with hourly windows with <source, destination, port> • Each tensor: 500500100 • 1200 timestamps (hours) value Sparse data Power-law distribution 10AM to 11AM on 01/06/2005
Data set 2: Bibliographic data (DBLP) • Papers from VLDB and KDD conferences • Construct 2nd order tensors with yearly windows with <author, keywords> • Each tensor: 45843741 • 11 timestamps (years)
Computational cost • OTA is the offline tensor analysis • Performance metric: CPU time (sec) • Observations: • DTA and STA are orders of magnitude faster than OTA • The slight upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) 3rd order network tensor 2nd order DBLP tensor
Accuracy comparison • Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% • Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3rd order network tensor 2nd order DBLP tensor
Reconstruction error over time Normal traffic Abnormal traffic Network anomaly detection • Reconstruction error gives indication of anomalies. • Prominent difference between normal and abnormal ones is mainly due to unusual scanning activity (confirmed by the campus admin).
Multiway LSI DB DM • Two groups are correctly identified: Databases and Data mining • People and concepts are drifting over time
Conclusion • Tensor stream is a general data model • DTA/STA incrementally decompose tensors into core tensors and projection matrices • The result of DTA/STA can be used in other applications • Anomaly detection • Multiway LSI
Final word:Think structurally! • The world is not flat, neither should data mining be. Contact: Jimeng Sun jimeng@cs.cmu.edu