Efficient Streaming Data Summarization for Diverse Applications

Ph.D.Thesis Proposal Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun May 15, 2006

Thesis Committee • Christos Faloutsos (Chair) • Tom Mitchell • Hui Zhang • David Steier, PricewaterhouseCoopers • Philip Yu, IBM Watson Research Center

Thesis Proposal • Goal: incremental pattern discovery on streaming applications • Streams: • E1: Environmental sensor networks • E2: Cluster/data center monitoring • Graphs: • E3: Social network analysis • Tensors: • E4: Network forensics • E5: Financial auditing • E6: fMRI: Brain image analysis • How to summarize streaming data efficiently and incrementally?

Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak sensors away from leak water distribution network normal operation May have hundreds of measurements, and they are often related!

Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak May have hundreds of measurements, and they are often related!

: : : : : : : : : : : : E1: Environmental Sensor Monitoring SPIRIT Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1-2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends

E3: Social network analysis • Traditionally, people focus on static networks and find community structures • We plan to monitor the change of the community structure over time and identify abnormal individuals

destination source E4: Network forensics • Directional network flows • A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] • 450 GB/hour with compression • Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie

Commonality of all • Data: continuously arriving • Large volume • Multi-dimensional • Unlabeled • Task: incremental pattern discovery • Main trends • Anomalies

Thesis statement • Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.

Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion

Static Data model • Tensor • Formally, • Generalization of matrices • Represented as multi-array, data cube.

where n is increasing over time time … Dynamic Data model (our focus) • Tensor Streams • A sequence of Mth order tensor keyword time author … …

Our framework for incremental pattern discovery Mining flow Data Streams Preprocessing Tensor Streams Tensor Analysis Projections Core tensors Application Modules Anomaly Detection Clustering Prediction

Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion

Related work Our Work

R R R Background – Singular value decomposition (SVD) • SVD • Best rank k approximation in L2 • PCA is an important application of SVD • Note that U and V are dense and may have negative entries n n k k k VT A  U UT m m Y

Background – Latent semantic indexing (LSI) • Singular vectors are useful for clustering cluster cache frequent query pattern concept-association DM x x = DB document-concept concept-term

Background: Tensor Operations • Matricizing • Unfold a tensor into a matrix Port Destination*Port Destination Source Source Source

Background: Tensor Operations • Mode-product • Multiply a tensor with a matrix Port Port source source Destination Destination “group” “group” Source

Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion

Methodology map data order

Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant

Why do we care? • Anomaly detection • Reconstruction error driven • Multiple resolution • Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query

1st order DTA - problem Given x1…xn where each xi RN, find URNR such that the error e is small: N Y UT x1 R ? Sensors …. n time xn indoor Note that Y = XU Sensors outdoor

1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT =Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!

1st order STA: SPIRIT • Adjust U smoothly when new data arrive without diagonalization • For each new pointx • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : • yi := UiTx (proj. onto Ui) • didi + yi2 (energy  i-th eigenval.) • ei := x – yiUi (error) • UiUi + (1/di) yiei (update estimate) • xx – yiUi (repeat with remainder) error Sensor 2 U Sensor 1

Mth order DTA

Mth order DTA – complexity Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation:  Ni3 (or  Ni2) diagonalization of C +  Ni Nimatrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplicationis the main cost

x U1 updated e1 U1 y1 Streaming tensor analysis (STA) • Run SPIRIT along each mode • Complexity: • Storage: O( Ni) • Computation:  Ri Niwhich is smaller than DTA

Experiment • Goal • Computation efficiency • Accurate approximation • Real applications • Anomaly detection • Clustering

Data set 1: Network data • TCP flows collected at CMU backbone • Raw data 500GB with compression • Construct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value> • Each tensor: 500500 or 500500100 biased sampled from over 22k hosts • 1200 timestamps (hours) Sparse data Power-law distribution 10AM to 11AM on 01/06/2005

Data set 2: Bibliographic data (DBLP) • Papers from VLDB and KDD conferences • Construct 2nd order tensors with yearly windows with <author, keywords, num> • Each tensor: 45843741 • 11 timestamps (years)

Computational cost • OTA is the offline tensor analysis • Performance metric: CPU time (sec) • Observations: • DTA and STA are orders of magnitude faster than OTA • The slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) 3rd order network tensor 2nd order DBLP tensor

Accuracy comparison • Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% • Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3rd order network tensor 2nd order DBLP tensor

Reconstruction error over time Normal traffic Abnormal traffic Network anomaly detection • Reconstruction error gives indication of anomalies. • Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).

Multiway LSI DB DM • Two groups are correctly identified: Databases and Data mining • People and concepts are drifting over time

Quick summary of DTA/STA • Tensor stream is a general data model • DTA/STA incrementally decompose tensors into core tensors and projection matrices • The result of DTA/STA can be used in other applications • Anomaly detection • Multiway LSI Incremental computation!

Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion

Methodology map data order

Disadvantage of orthogonal projection on sparse data • Real data are often (very) sparse • Orthogonal projection does not preserve the sparsity in the data • more space than original data • large computational cost

Interpretability problem of orthogonal projection • Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5] • All the data are projected onto the span of Ui • It is hard to interpret the projections

Compact matrix decomposition (CMD) • Example-based projection: use actual rows and columns to specify the subspace • Given a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small Example-based Orthogonal projection U is the pseudo-inverse of X

CMU from 4K feet CMD algorithm (high level)

1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 R U Rd A Cd CMD algorithm (high level) • Biased sample with replacement of columns and rows from A • Remove duplicates with proper scaling • Construct U from C and R (pseudo-inverse of the intersection of C and R) Remove duplicates with proper scaling C

CMD algorithm (low level) CMU from 3 feet

CMD algorithm (low level) • Remove duplicates with proper scaling Ci = ui1/2 Ci Ri = vi Ri ui, vi the number of occurrences of Ci and Ri Theorem: Matrix C and Cd have the same singular values and left singular vectors Proof: see [Sun06]

Experiment • Datasets • Performance metrics • Space ratio to the original data • CPU time (sec) • Accuracy = 1 – reconstruction error

Space efficiency • CMD uses much smaller space to achieve the same accuracy • CUR limitation: duplicate columns and rows • SVD limitation: orthogonal projection densifies the data Network DBLP

Computational efficiency • CMD is fastest among all three • CMD and CUR requires SVD on only the sampled columns • CUR is much worse than CMD due to duplicate columns • SVD is slowest since it performs on the entire data DBLP Network

Efficient Streaming Data Summarization for Diverse Applications

Efficient Streaming Data Summarization for Diverse Applications

Presentation Transcript

Tensors

Tensors and Component Analysis

Differential Manifolds and Tensors

Fast Incremental Proximity Search in Large Graphs

Pattern Finding and Pattern Discovery in Time Series

Data Mining in Streams and Graphs

Frequent Pattern Mining in Data Streams

Mining Large Graphs: Spectral Methods, Tensors and Influence propagation

Algorithms for pattern matching and pattern discovery in music

IEPAD: Information Extraction Based on Pattern Discovery

Data Mining and Access Pattern Discovery

IEPAD: Information Extraction based on Pattern Discovery

Incremental Graph Pattern Matching

Pattern Discovery and Recognition for Genetic Regulation

High-performance Pattern Detection and Discovery for Databases and Data Streams

CyberBridges Protein Pattern Discovery

Cascading Spatio-Temporal Pattern Discovery

Beyond Streams and Graphs: Dynamic Tensor Analysis

On Anomalous Hot Spot Discovery in Graph Streams

Tensors