Searching Large Scientific Data

Searching Large Scientific Data John Wu Scientific Data Management Lawrence Berkeley National Laboratory

Outline • Highlight of Accomplishments • Grid Collector (accelerate others’ work) • Query-Driven Visualization (enabling new way of knowledge discovery) • Molecular docking (enabling others to accomplish great things) • Outlook • More complex searches • Parallelization • Supporting more data formats • Integration with large framework

FastBit In a Nutshell FastBit is designed to search multi-dimensional append-only data Conceptually in table format rows  objects columns  attributes FastBit uses vertical (column-oriented) organization for the data Efficient for searching FastBit uses bitmap indices with our compression method Proven in analysis to be optimal for one-dimensional queries Faster than other optimal indexes for multi-dimensional queries column row [Wu, Otoo, Shoshani 2006]

Motivation • Scientific datasets are getting larger fast • Most data analysis algorithm can not handle a whole dataset • Therefore, most data analysis tasks are performed on a subset of the data • Some examples of searches • Find the collision events with the most distinct features of Quantum-Qluon-Plasma from a high-energy physics experiment • Find and tracking ignition in a combustion simulation • Identify the puppet-master bedind a distribution denial-of-service attack on a computer network

Highlight 1 – Grid Collector • Searching over billions of objects with hundreds of attributes each: • Distributed analysis over the Grid • Make petabytes of raw data available for world wide analyses • Benefits of the Grid Collector: • Transparent object access, select objects based on their attributes • Improvement of analysis system’s throughput • Best Paper Award (ISC’05) [Wu, Gu, Lauret, Poskanzer, Shoshani, Sim and Zhang 2005]

Grid Collector Speeds up Analyses • Test machine: 2.8 GHz Xeon, 27 MB/s read speed • When searching for rare events, say, selecting one event out of 1000, using GC is 20 to 50 times faster • Using GC to read 1/2 of events, speedup > 1.5, 1/10 events, speed up > 2. • Bottom line – improve the throughtput of data analyses!

Highlight 2 – Visualization • Query-Driven Visualization – collaboration between SDM and VACET • Use FastBit indexes to efficiently select the most interesting data for visualization • Above example: laser wakefield accelerator simulation • VORPAL produces 2D and 3D simulations of particles in laser wakefield • Finding and tracking particles with large momentum is key to design the accelerator • Brute-force algorithm is quadratic (taking 5 minutes on 0.5 mil particles), FastBit time is linear in the number of results (takes 0.3 s, 1000 X speedup)

Bin-Based Parallel Coordinate Display • Integrate FastBit with H5Part, a HDF5 package for particle physics data • Use FastBit to compute histograms efficiently • Bin-based parallel coordinate display reduces the number of lines displayed on screen, reduces visual clutter, reduces response time • FastBit further speeds up the response time further

FastBit Speeds up Historgraming Lower is better ~ 104 X • Time needed to compute desired histograms • Custom code that directly uses the raw data directly • FastBit can be 1000 X faster than the custom code (left) • FastBit maintains the performance advantage on a parallel system

n ligands One target protein n docking runs Hit list Name Score 1bef -16,4 4dab -12,3 4d2a -11,6 … … Match ligand with cavity Highlight 3 – Molecular Docking • Jochen Schlosser [schlosser@zbh.uni-hamburg.de]Center for Bioinformatics, University of Hamburg • Application: Structure-based virtual screening (ACS Fall 2007) Standard approach: match every ligand with every target protein New approach: using FastBit indexes to avoid brute-force matching

Use of FastBit for Molecular Docking Method • Specification of the descriptor as triangle geometry • Types of interaction centers • Triangle side lengths • Interaction directions • 80 bulk dimensions • Receptors • Receptor descriptors are generated similarly • Using complementary information where necessary • Use of pharmacophore constraints on receptor triangles • Reduces number of queries • Improved query selectivity because the pharmacophore tends to be inside the protein cavity

attribute(i) [0] ... … … [n] desc1 desc2 desc3 desc4 desc5 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 Bitmap index Use of FastBit for Molecular Docking Method • Indexing system • Properties of the problem: • Billions of descriptors (~ 1,000 for each ligand) • High dimensional query • Properties of bitmap indexes • Well suited for those kind of queries • Can be run stand alone • Further compression possible • FastBit uses compression • Results • TrixX-BMI is an efficient tool for virtual screening with average runtime in sub-second range • screen libraries of ligands 12 times faster than FlexX without pharmacophore constraints • With pharmacophore constraints, speedup 140 – 250

Outline • Highlight of Accomplishments • Grid Collector • Query-Driven Visualization • Molecular docking • Outlook • More complex searches • Parallelization • Supporting more data formats • Integration with large framework

Complex Searches • So far, FastBit software primarily handles range queries of the form “pressure > 105 and temperature between 800 and 1000” • Need to support complex types of searches • GTC data analysis: find all particles with certain energy level that have passed through a region with specified properties on the electric field • Network security: find the hosts that have contacted all identified drones within an hour of the start of an attack • Protein sequences: Identify known proteins with specified molecular weight • Catalog matching: matching records of stars and galaxies from one survey / simulation to another one • Subqueries: searching the results of previous searches

Complex Searches • Extending the histograming functionality: group by, top-k, automatic computation of derived fields • Implement join algorithm • Existing bitmap indexes are efficient for filtering out the desired records for common join algorithms such as sort-merge join • Existing bitmap index based join algorithms appear promising from back-of-envelope calculation • A* algorithm: for programs such as neighborhood expansion, formulating them as joins may be not as efficient as using alternative searching algorithms, such as, A*

Parallelization • For I/O dominated tasks, • Take advantage of parallel I/O system, PVFS • Better data layout to effectively utilize the I/O hardware • Active Storage, In-Situ data processing • For CPU dominated tasks, • Devise new algorithms, e.g., parallel join algorithms, new join indexes • Algorithms for GPU, Cell processor, and many-core architecture

More Data Formats • Working with application specialist to integrate FastBit with their data library • H5Part: HDF5 • ROOT (?) • ADIOS • Restructure FastBit to make it easier to work with different data formats • Virtualize data sources

Integrated Data Analysis Framework • Iterator for coarse grain data • Examples: ROOT and Map-Reduce • Indexing provides a way to implement a “smart iterator”, e.g., Grid Collector for STAR data analysis framework (using ROOT) • Framework for fine grain data • Tighter integration with programmatic API • Provide scripting support for productivity layer (end user)

Indexes Facilitate Smart Analysis Indexes go here! Or How to make your system smarter!

Searching Large Scientific Data