600 likes | 620 Views
Ph.D.Thesis Proposal. Incremental Pattern Discovery on Streams, Graphs and Tensors. Jimeng Sun. May 15, 2006. Thesis Committee. Christos Faloutsos (Chair) Tom Mitchell Hui Zhang David Steier, PricewaterhouseCoopers Philip Yu, IBM Watson Research Center. Thesis Proposal.
E N D
Ph.D.Thesis Proposal Incremental Pattern Discovery on Streams, Graphs and Tensors Jimeng Sun May 15, 2006
Thesis Committee • Christos Faloutsos (Chair) • Tom Mitchell • Hui Zhang • David Steier, PricewaterhouseCoopers • Philip Yu, IBM Watson Research Center
Thesis Proposal • Goal: incremental pattern discovery on streaming applications • Streams: • E1: Environmental sensor networks • E2: Cluster/data center monitoring • Graphs: • E3: Social network analysis • Tensors: • E4: Network forensics • E5: Financial auditing • E6: fMRI: Brain image analysis • How to summarize streaming data efficiently and incrementally?
Phase 1 Phase 2 Phase 3 : : : : : : chlorine concentrations : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak sensors away from leak water distribution network normal operation May have hundreds of measurements, and they are often related!
Phase 1 Phase 2 Phase 3 : : : : : : : : : : : : E1: Environmental Sensor Monitoring CMU civil department Prof. Jeanne M. VanBriesen sensors near leak chlorine concentrations sensors away from leak water distribution network normal operation major leak May have hundreds of measurements, and they are often related!
: : : : : : : : : : : : E1: Environmental Sensor Monitoring SPIRIT Phase 1 Phase 2 Phase 3 Phase 1 Phase 2 Phase 3 chlorine concentrations k = 1-2 actual measurements (n streams) k hidden variable(s) We would like to discover a few “hidden (latent) variables” that summarize the key trends
E3: Social network analysis • Traditionally, people focus on static networks and find community structures • We plan to monitor the change of the community structure over time and identify abnormal individuals
destination source E4: Network forensics • Directional network flows • A large ISP with 100 POPs, each POP 10Gbps link capacity [Hotnets2004] • 450 GB/hour with compression • Task: Identify abnormal traffic pattern and find out the cause abnormal traffic normal traffic destination source Collaboration with Prof. Hui Zhang and Dr. Yinglian Xie
Commonality of all • Data: continuously arriving • Large volume • Multi-dimensional • Unlabeled • Task: incremental pattern discovery • Main trends • Anomalies
Thesis statement • Incremental and efficient summarization of heterogonous streaming data through a general and concise presentation enables many real applications in different domains.
Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion
Static Data model • Tensor • Formally, • Generalization of matrices • Represented as multi-array, data cube.
where n is increasing over time time … Dynamic Data model (our focus) • Tensor Streams • A sequence of Mth order tensor keyword time author … …
Our framework for incremental pattern discovery Mining flow Data Streams Preprocessing Tensor Streams Tensor Analysis Projections Core tensors Application Modules Anomaly Detection Clustering Prediction
Outline • Motivating examples • Data model and mining framework • Related work • Current work • Proposed work • Conclusion
Related work Our Work
R R R Background – Singular value decomposition (SVD) • SVD • Best rank k approximation in L2 • PCA is an important application of SVD • Note that U and V are dense and may have negative entries n n k k k VT A U UT m m Y
Background – Latent semantic indexing (LSI) • Singular vectors are useful for clustering cluster cache frequent query pattern concept-association DM x x = DB document-concept concept-term
Background: Tensor Operations • Matricizing • Unfold a tensor into a matrix Port Destination*Port Destination Source Source Source
Background: Tensor Operations • Mode-product • Multiply a tensor with a matrix Port Port source source Destination Destination “group” “group” Source
Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion
Methodology map data order
Tensor analysis Given a sequence of tensors find the projection matrices such that the reconstruction error e is minimized: … t … Note that this is a generalization of PCA when n is a constant
Why do we care? • Anomaly detection • Reconstruction error driven • Multiple resolution • Multiway latent semantic indexing (LSI) Philip Yu time Michael Stonebreaker Pattern Query
1st order DTA - problem Given x1…xn where each xi RN, find URNR such that the error e is small: N Y UT x1 R ? Sensors …. n time xn indoor Note that Y = XU Sensors outdoor
1st order DTA Input: new data vector x RN, old variance matrix C RN N Output: new projection matrix U RN R Algorithm: 1. update variance matrix Cnew = xTx + C 2. Diagonalize UUT =Cnew 3. Determine the rank R and return U Old X time x x UT Cnew U C xT Diagonalization has to be done for every new x!
1st order STA: SPIRIT • Adjust U smoothly when new data arrive without diagonalization • For each new pointx • Project onto current line • Estimate error • Rotate line in the direction of the error and in proportion to its magnitude For each new point x and for i = 1, …, k : • yi := UiTx (proj. onto Ui) • didi + yi2 (energy i-th eigenval.) • ei := x – yiUi (error) • UiUi + (1/di) yiei (update estimate) • xx – yiUi (repeat with remainder) error Sensor 2 U Sensor 1
Mth order DTA – complexity Storage: O( Ni), i.e., size of an input tensor at a single timestamp Computation: Ni3 (or Ni2) diagonalization of C + Ni Nimatrix multiplication X (d)T X(d) For low order tensor(<3), diagonalization is the main cost For high order tensor, matrix multiplicationis the main cost
x U1 updated e1 U1 y1 Streaming tensor analysis (STA) • Run SPIRIT along each mode • Complexity: • Storage: O( Ni) • Computation: Ri Niwhich is smaller than DTA
Experiment • Goal • Computation efficiency • Accurate approximation • Real applications • Anomaly detection • Clustering
Data set 1: Network data • TCP flows collected at CMU backbone • Raw data 500GB with compression • Construct 2nd or 3rd order tensors with hourly windows with <source, destination,value> or <source, destination, port, value> • Each tensor: 500500 or 500500100 biased sampled from over 22k hosts • 1200 timestamps (hours) Sparse data Power-law distribution 10AM to 11AM on 01/06/2005
Data set 2: Bibliographic data (DBLP) • Papers from VLDB and KDD conferences • Construct 2nd order tensors with yearly windows with <author, keywords, num> • Each tensor: 45843741 • 11 timestamps (years)
Computational cost • OTA is the offline tensor analysis • Performance metric: CPU time (sec) • Observations: • DTA and STA are orders of magnitude faster than OTA • The slide upward trend in DBLP is due to the increasing number of papers each year (data become denser over time) 3rd order network tensor 2nd order DBLP tensor
Accuracy comparison • Performance metric: the ratio of reconstruction error between DTA/STA and OTA; fixing the error of OTA to 20% • Observation: DTA performs very close to OTA in both datasets, STA performs worse in DBLP due to the bigger changes. 3rd order network tensor 2nd order DBLP tensor
Reconstruction error over time Normal traffic Abnormal traffic Network anomaly detection • Reconstruction error gives indication of anomalies. • Prominent difference between normal and abnormal ones is mainly due to the unusual scanning activity (confirmed by the campus admin).
Multiway LSI DB DM • Two groups are correctly identified: Databases and Data mining • People and concepts are drifting over time
Quick summary of DTA/STA • Tensor stream is a general data model • DTA/STA incrementally decompose tensors into core tensors and projection matrices • The result of DTA/STA can be used in other applications • Anomaly detection • Multiway LSI Incremental computation!
Outline • Data model • Framework • Related work • Current work • Dynamic and Streaming tensor analysis (DTA/STA) • Compact matrix decomposition (CMD) • Proposed work • Conclusion
Methodology map data order
Disadvantage of orthogonal projection on sparse data • Real data are often (very) sparse • Orthogonal projection does not preserve the sparsity in the data • more space than original data • large computational cost
Interpretability problem of orthogonal projection • Each column of projection matrix Ui is a linear combination of all dimensions along certain mode Ui(:,1) = [0.5; -0.5; 0.5; 0.5] • All the data are projected onto the span of Ui • It is hard to interpret the projections
Compact matrix decomposition (CMD) • Example-based projection: use actual rows and columns to specify the subspace • Given a matrix ARmn, find three matrices C Rmc, U Rcr, R Rr n , such that ||A-CUR|| is small Example-based Orthogonal projection U is the pseudo-inverse of X
CMU from 4K feet CMD algorithm (high level)
1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 R U Rd A Cd CMD algorithm (high level) • Biased sample with replacement of columns and rows from A • Remove duplicates with proper scaling • Construct U from C and R (pseudo-inverse of the intersection of C and R) Remove duplicates with proper scaling C
CMD algorithm (low level) CMU from 3 feet
CMD algorithm (low level) • Remove duplicates with proper scaling Ci = ui1/2 Ci Ri = vi Ri ui, vi the number of occurrences of Ci and Ri Theorem: Matrix C and Cd have the same singular values and left singular vectors Proof: see [Sun06]
Experiment • Datasets • Performance metrics • Space ratio to the original data • CPU time (sec) • Accuracy = 1 – reconstruction error
Space efficiency • CMD uses much smaller space to achieve the same accuracy • CUR limitation: duplicate columns and rows • SVD limitation: orthogonal projection densifies the data Network DBLP
Computational efficiency • CMD is fastest among all three • CMD and CUR requires SVD on only the sampled columns • CUR is much worse than CMD due to duplicate columns • SVD is slowest since it performs on the entire data DBLP Network