130 likes | 146 Views
Data Stream Mining Assignment 2. Enhanced Situation Space Mining for Data Streams. Rishabh Upadhyay Fr. Conceicao Rodrigues College of Engineering Mumbai, India uhrishabh@gmail.com. Sivan Toledo Tel-Aviv University Tel-Aviv, Israel stoledo@tau.ac.il. Yisroel Mirsky
E N D
Data Stream Mining Assignment 2 Enhanced Situation Space Mining for Data Streams Rishabh Upadhyay Fr. Conceicao Rodrigues College of Engineering Mumbai, India uhrishabh@gmail.com Sivan Toledo Tel-Aviv University Tel-Aviv, Israel stoledo@tau.ac.il Yisroel Mirsky Ben-Gurion University Beer-Sheva, Israel yisroel@post.bgu.ac.il Tal Halpern Tel-Aviv University Tel-Aviv, Israel talhalpern10@gmail.com Yuval Elovici Ben-Gurion University Beer-Sheva, Israel elovici@bgu.ac.il Presented by Pooja Joshi Waikato University Hamiltion, New Zealand pj60@students.Waikato.ac.nz
Abstract • pcStream • Algorithm to extract the knowledge of present situation from data stream. • It is a machine learning algorithm for finding context or concepts in a numerical stream in an unsupervised manner. • Drawback of pcStream • Complexity due to Principle Component Analysis (PCA) • Situation overlap • pcStream2 – variant of pcStream • Incremental PCA (IPCA) to reduce complexity and memory requirement • Just-In-Time PCA`- algorithm to implement IPCA
Introduction • Context Space Theory (CST) • CST is applied to get the actor’s situation from the given data stream. • CST is used in many context-aware applications • What is context? • Point in n-dimensional space(context space) • Drawbacks of CST • Define situation space manually • Situation space minning is difficult, • data streams are unbounded • subject to concept drift Figure 1 - An illustration of a context domain for activity recognition consisting of two situation spaces: walking and running (c1 and c2).
PCStream Pseudo code for pcStream • Step 1 – When an instance(X) arrives, compute statistical similarity(Mahalanobis distance) between X and each known context • Step 2 – If X is within the distribution, assign X to the context • Else If X doesn't fit to any context, assign X to buffer B for time tmin • If any observation after X is not placed in B for tmin, • B is labeled as noise and is emptied • Else, X is assigned to current distribution • If B is full, it can be said that a new situation space is found and the content of B is emptied Algorithm 1 - pcStream algorithm
Drawbacks of pcStream algorithm • Detecting new overlapping situation spaces • Similarity score –Mahalanobis distance • Algorithm complexity • Uses Principle Component Analysis(PCA) • Algorithm complexity O(n3) • Issues overcome by pcStream2 • Windowing – overcome overlapping situation space. • Incremental PCA(JIT-PCA) instead of PCA Figure 3 - An illustration of the issues with detecting overlapping situation spaces from a data stream generated from smartphone accelerometer. Here the ground truth is activity recognition.
PCStream2 • 2 changes • Persistence • Before assign X to its closest context, first check d(X)<threshold. • Windowing • Consider latest instances in tmin observations. • 3 stages • Push • When instance(X) arrives, it is pushed in buffer B. • Process • When X pops out as, |B|> tmin,process X • Detect • When B is full, we check for any evolving new situation. Algorithm 2 – pcStream2 algorithm
PCStream & IPCA • Implementing IPCA over PCA has following advantages • Reduced Complexity • Reduced Memory Consumption • Damped Window
JIT-PCA • JIT-PCA is a heuristic randomized incremental PCA. • Implements ideas from previous literature to build effective and fast algorithm • QR-based update formula • Least updating cost when delta=0 • Randomize sketching algorithms • Compute total mass and average • Compute the probability to decide to use orthogonal part in update model • Relaxation mode – wait till the update is stabilized. Algorithm 3 – pcStream2 algorithm
Experimental Results Parameters used for valuation Table 2: The parameters used in the evaluations over each dataset. Dataset used for evaluation Table 1: Summaries of the three datasets used for the evaluations.
Experimental Results - Ctnd 2 Parameter selection robustness 1 Adjusted Rand Index - ARI Figure 4: The resulting ARI for every parameter selection for both pcStream and pcStream2 over the SherLock dataset. Figure 3: The best ARI achieved by pcStream and pcStream2 (left), pcStream with PCA and JIT-PCA (right), for eac dataset.
Experimental Results - Ctnd 2 Runtime evaluation (PCA v/s JIT-PCA) 1 Accuracy and runtime (PCA v/s IPCA) Figure 5: A comparison of accuracy (top row) and runtime (bottom row) when using PCA and IPCA with different pcStream parameters over the SCA dataset. Figure 6: The runtimes of pcStream with PCA and JIT-PCA for each dataset.
Experimental Results - Ctnd 1 Feature evaluation (PCA v/s JIT-PCA) on KDD dataset Figure 7: The affect the number of dimensions have on pc- Stream’s runtime on the KDD dataset with PCA and JIT-PCA respectively. The bars represent the standard deviation.