170 likes | 198 Views
Data mining methods in RAVEN network. Nikolai Gagunashvili (University of Akureyri, Iceland) nikolai@unak.is. Data collected and stored at enormous speeds Traditional techniques infeasible for raw data Methods of data mining with visualization techniques may help scientists
E N D
Data mining methods in RAVEN network Nikolai Gagunashvili (University of Akureyri, Iceland) nikolai@unak.is RAVEN Workshop Heidelberg
Data collected and stored at enormous speeds • Traditional techniques infeasible for raw data • Methods of data mining with visualization techniques may help scientists • in classifying and segmenting data • in Hypothesis Formation RAVEN Workshop Heidelberg
Data processing can be realized at the RAVEN network • Unsupervised classification (cluster analysis) • Detection anomalies (outliers) • Supervised classification for selection rare events RAVEN Workshop Heidelberg
Algorithm 1_RAVEN Basic K-means Algorithm • Select K points as the initial centroids at user node and sheare this information in network. • Repeat • Form K clusters by assigning all points to the closest centroid for each node. • Recompute the centroind of each cluster for each nodes. Output of each node are positions of centroids with number of points assigned to centroids. • Recalculate the positions of centroids for whole network and sheare this information in network. • Until The centroids positions for cluster don’t change. RAVEN Workshop Heidelberg
Anomaly/Outlier Detection • What are anomalies/outliers? • The set of data points that are considerably different than the remainder of the data • Variants of Anomaly/Outlier Detection Problems • Given a database D, find all the data points x D with anomaly scores greater than some threshold t • Given a database D, find all the data points x D having the top-n largest anomaly scores f(x) • Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D RAVEN Workshop Heidelberg
Clustering-Based algorithm anomaly detection Outlier score defined as relative distance. It is the ratio of the distance of point from the closest centroid to the median distance of all points in the cluster from the centroid. RAVEN Workshop Heidelberg
Clustering-Based algorithm of anomaly detection Algorithm 2 Basic clustering-based algorithm of anomaly detection • Find position of K centroids (Algorithm 1) • Find median distance for each centroid • Calculate the score of each pont that is (dist of point to nearist centroid)/(median dist. for given nearist centorid) • Order the scores to define outliers. RAVEN Workshop Heidelberg
Clustering-Based algorithm of anomaly detection Algorithm 2_RAVEN Basic clustering-based algorithm of anomaly detection • Find position of K centroids (Algorithm 1_RAVEN) • Find median distance for each centroid of network. Histogram-based method can be used for that. • Calculate the score of each point that is (dist of point to nearest centroid)/(median dist. for given nearest centroid ) • Order the scores for each node. • Use highest scores from nodes and find points with highest score for network RAVEN Workshop Heidelberg
Clustering-Based. Strength and Weaknesses • Some clustering techniques, such as K-mean, have linear or close to linear time and space complexity. • Possible to find both clusters and outliers at the same time. • The set of outliers produced and their score can be heavily dependent upon the number of clusters as well as the presence of outliers in the data. • The quality of outliers produced by a clustering is heavily impacted by the quality of clusters produced by algorithm. • The clustering algorithm needs to be chosen carefully. RAVEN Workshop Heidelberg
Supervised classification for selection rare events In the analysis the LHCb focuses on very specific decay modes of B mesons which are sensitive to quantum effects caused by as yet undiscovered heavy particles (“New Physics”). ▪ Relative frequencies of B-decays 10-4 - 10-6. ▪ Every B-decay inside an event there are about 5-10 times that number of tracks from non-B-decays The success of LHCb and the other LHC experiments therefore depends critically on the availability of sufficiently powerful analysis tools. Main research must be addresses the issues of finding signals in a huge background. RAVEN Workshop Heidelberg
The typical imbalanced problem is credit card fraud or AIDS tests. In these problems it is important to detect all rare (signal) instances if possible. In particle selection on the other hand, performance criteria is signal/noise ratio or significance that is typical for detector devices. The second difference is the usage of real cases in data mining in contrast to simulated training and test samples in particle physics. Classification algorithms must therefore be robust with respect to differences in properties between simulated and real data. RAVEN Workshop Heidelberg
Example of selection algorithm IPpi, DoCA, IP, IPp, ptpi - attributes of event D0, BG - class of event RAVEN Workshop Heidelberg
Background/signal ~ 3000 Traditional in data analysis cut based method of selection Selection algorithms used methods of supervised classifications Britsch, XVII International Workshop on Deep-Inelastic Scattering and Related Subjects, 2009, Madrid RAVEN Workshop Heidelberg
Precision of measurements can be improved • New method can help find particles that can not to be found by old method. • Trigger can be organized to register only events interesting for particular physical analysis. • New method can found application for detection complex events for example in radars, sonars, lidars technique RAVEN Workshop Heidelberg