10 likes | 126 Views
Event Metadata Records as a Testbed for Scalable Data Mining. David Malon, Peter van Gemmeren (Argonne National Laboratory). 1. Abstract.
E N D
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) 1. Abstract At a data rate of 200 hertz, event metadata records ("TAGs," in ATLAS parlance) provide fertile grounds for development and evaluation of tools for scalable data mining. It is easy, of course, to apply HEP-specific selection or classification rules to event records and to label such an exercise "data mining," but our interest is different. Advanced statistical methods and tools such as classification, association rule mining, and cluster analysis are common outside the high energy physics community. These tools can prove useful, not for discovery physics, but for learning about our data, our detector, and our software. A fixed and relatively simple schema makes TAG export to other storage technologies such as HDF5 straightforward. This simplifies the task of exploiting very-large-scale parallel platforms such as Argonne National Laboratory's BlueGene/P, currently the largest supercomputer in the world for open science, in the development of scalable tools for data mining. Using a domain-neutral scientific data format may also enable us to take advantage of existing data mining components from other communities. There is, further, a substantial literature on the topic of one-pass algorithms and stream mining techniques, and such tools may be inserted naturally at various points in the event data processing and distribution chain. This paper describes early experience with event metadata records from ATLAS simulation and commissioning as a testbed for scalable data mining tool development and evaluation. 2. Details Reconstruction AOD merging TAG production Primary DPD making DPD making TAG scale is attractive as a data mining testbed First steps: neutral data format (HDF5) DNPD Data D1PD Data Event Filter Raw Data ESD Data AOD Data • ATLAS TAG format is relatively simple, readily mapped to other technologies • a few TAG attributes depend upon run-dependent encoding, though, so using _every_ attribute for potential mining poses challenges • but mining every attribute is not essential for most purposes • current TAG storage technology in ATLAS is ROOT via POOL Collections, Oracle via POOL, MySQL via POOL (limited) • not well suited to use of off-the-shelf mining tools, or to porting to high-performance hardware • HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections. • some data mining tools, both open-source and proprietary, can read HDF5 (e.g. rattle, mathworks) TAG TAG • 2 x 109 events per year (real data) • equivalent amount of Monte Carlo • have more than 0.5 x 109 tags coming from Monte Carlo already in the next months • also millions of TAGs from cosmic ray commissioning • content simpler, smaller than that of physics TAGs TAG extract/copy Tier 0 Tier 1 Tier 2 Figure 1: Data flow at ATLAS. Standard TAG processing provides opportunities for mining TAG content and mining potential • Example: association rules • physics data contain known associationsbehavior • e.g., between trigger and physics content • testbed and tool validation • What associations are found by statistical tools without domain-specific knowledge? • Example: clustering • we already cluster data when we do streaming • e.g., by trigger • testbed and tool validation: • if we treat, e.g., a subset of event attributes (say, N physics attributes, no trigger information) as points in N-dimensional space, what clusters are found by statistical tools? • Example: one-pass algorithms • learn from data as they go by--no iterative processing • distributional characteristics, patterns • can detect outliers and anomalies, shifts in machine or detector behavior • not for discovery physics, but we have already detected software bugs by anomaly detection in TAG data • this is how network traffic is mined • tag processing provides several loci for nonintrusive insertion of stream mining tools • e.g., tag upload or transfer TAG mining and BlueGene/P • With TAGs in HDF5, we do not need to port LCG or ATLAS software to BlueGenein order to begin • First steps: statistical tools for distributional characteristics, outliers, anomalies run on each node, processing disjoint subsets of data • no inter-process communication until an aggregation phase at the end • aggregation stage (can be done hierarchically, e.g., binary tree) • Next steps: use large-scale parallelism for mining problems that have a large search space • e.g., O(N2) possible pair-wise associations in N attributes • each node may process the entire sample • parallel I/O challenges are different than those for partitioned samples • Later steps: Address mining problems that require inter-process communication in parallel implementations • iterative process, exchange, process, exchange models • e.g., clustering and classification • We are only beginning. Figure 3: BlueGene/P. Figure 2: K-Means Clustering example