330 likes | 402 Views
Enabling Knowledge Discovery in a Virtual Universe. Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis. Jeffrey P. Gardner Andrew Connolly Cameron McBride. Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University. (happy scientist).
E N D
Enabling Knowledge Discovery in a Virtual Universe Harnessing the Power of Parallel Grid Resources for Astrophysical Data Analysis Jeffrey P. Gardner Andrew Connolly Cameron McBride Pittsburgh Supercomputing Center University of Pittsburgh Carnegie Mellon University
(happy scientist) Step 2: Analyze simulation on workstation Step 3: Extract meaningful scientific knowledge How to turn simulation output into scientific knowledge Using 300 processors: (circa 1995) Step 1: Run simulation
(happy scientist) Step 3: Extract meaningful scientific knowledge How to turn simulation output into scientific knowledge Using 1000 processors: (circa 2000) Step 2: Analyze simulation on server (in serial) Step 1: Run simulation
How to turn simulation output into scientific knowledge Using 4000+ processors: (circa 2006) (unhappy scientist) X Step 2: Analyze simulation on ??? Step 1: Run simulation
Mining the Universe can be (Computationally) Expensive • The size of simulations is no longer limited by computational power • It is limited by the parallelizability of data analysis tools • This situation, will only get worse in the future.
How to turn simulation output into scientific knowledge Using 100,000 processors?: (circa 2012) X Step 2: Analyze simulation on ??? Step 1: Run simulation By 2012, we will have machines that will have many hundreds of thousands of cores!
The Challenge of Data Analysis in a Multiprocessor Universe • Parallel programs are difficult to write! • Steep learning curve to learn parallel programming • Parallel programs are expensive to write! • Lengthy development time • Parallel world is dominated by simulations: • Code is often reused for many years by many people • Therefore, you can afford to invest lots of time writing the code. • Example:GASOLINE (a cosmology N-body code) • Required 10 FTE-years of development
The Challenge of Data Analysis in a Multiprocessor Universe • Data Analysis does not work this way: • Rapidly changing scientific inqueries • Less code reuse • Simulation groups do not even write their analysis code in parallel! • Data Mining paradigm mandates rapid software development!
(happy astronomer) Step 2: Analyze data on workstation Step 3: Extract meaningful scientific knowledge How to turn observationaldata into scientific knowledge Step 1: Collect data
The Era of Massive Sky Surveys • Paradigm shift in astronomy: Sky Surveys • Available data is growing at a much faster rate than computational power.
Good News for “Data Parallel” Operations • Data Parallel (or “Embarrassingly Parallel”): • Example: • 1,000,000 QSO spectra • Each spectrum takes ~1 hour to reduce • Each spectrum is computationally independent from the others • There are many workflow management tools that will distribute your computations across many machines.
Tightly-Coupled Parallelism(what this talk is about) • Data and computational domains overlap • Computational elements must communicate with one another • Examples: • Group finding • N-Point correlation functions • New object classification • Density estimation
The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe • Build a library that is: • Sophisticated enough to take care of all of the nasty parallel bits for you. • Flexible enough to be used for your own particular astrophysics data analysis application. • Scalable: scales well to thousands of processors.
The Challenge of Astrophysics Data Analysis in a Multiprocessor Universe • Astrophysics uses dynamic, irregular data structures: • Astronomy deals with point-like data in an N-dimensional parameter space • Most efficient methods on these kind of data use space-partitioning trees. • The most common data structure is a kd-tree.
Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Thread orchestration • Data management • Things that inhibit scalability: • Granularity (synchronization) • Load balancing • Data locality
Overview of existing paradigms: GSA • There are existing globally shared address space (GSA) compilers and libraries: • Co-Array Fortran • UPC • ZPL • Global Arrays • The Good: These are quite simple to use. • The Good: Can manage data locality well. • The Bad: Existing GSA approaches tend not to scale very well because of fine granularity. • The Ugly: None of these support irregular data structures.
Overview of existing paradigms: GSA • There are other GSA approaches that do lend themselves to irregular data structures: • e.g. Linda (tuple-space) • The Good: Almost universally flexible • The Bad: These tend not to scale even worse than the previous GSA approaches. • Granularity is too fine
Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Thread orchestration • Data management • Things that inhibit scalability: • Granularity • Load balancing • Data locality GSA
Overview of existing paradigms: RMI “Remote Method Invocation” rmi_broadcast(…, (*myFunction)); Computational Agenda Master Thread RMI layer Proc. 3 Proc. 0 Proc. 1 Proc. 2 RMI layer RMI Layer RMI Layer RMI Layer myFunction() myFunction() myFunction() myFunction() myFunction() is coarsely grained
Challenges for scalable parallel application development: • Things that make parallel programs difficult to write • Thread orchestration • Data management • Things that inhibit scalability: • Granularity • Load balancing • Data locality RMI
N tropy: A Library for Rapid Development of kd-tree Applications • No existing paradigm gives us everything we need. • Can we combine existing paradigms beneath a simple, yet flexible API?
N tropy: A Library for Rapid Development of kd-tree Applications • Use RMI for orchestration • Use GSA for data management
Proc 0 Proc1 Proc2 Proc3 Proc4 Proc5 Proc6 Proc7 Proc8 A Simple N tropy Example:N-body Gravity Calculation • Cosmological “N-Body” • simulation • 100,000,000 particles • 1 TB of RAM 100 million light years
A Simple N tropy Example:N-body Gravity Calculation ntropy_Dynamic(…, (*myGravityFunc)); Computational Agenda Master Thread N tropy master layer Proc. 3 Proc. 0 Proc. 1 Proc. 2 N tropy thread service layer N tropy thread service layer N tropy thread service layer N tropy thread service layer myGravityFunc() myGravityFunc() myGravityFunc() myGravityFunc() P1 P2 … … Pn Particles on which to calculate gravitational force
Proc 0 Proc1 Proc2 Proc3 Proc4 Proc5 Proc6 Proc7 Proc8 A Simple N tropy Example:N-body Gravity Calculation • Cosmological “N-Body” • simulation • 100,000,000 particles • 1 TB of RAM To resolve the gravitational force on any single particle requires the entire dataset 100 million light years
0 1 8 2 5 9 12 3 4 6 7 10 11 13 14 A Simple N tropy Example:N-body Gravity Calculation Proc. 3 Proc. 0 Proc. 1 Proc. 2 N tropy thread service layer N tropy thread service layer N tropy thread service layer N tropy thread service layer myGravityFunc() myGravityFunc() myGravityFunc() myGravityFunc() N tropy GSA layer N tropy GSA layer N tropy GSA layer N tropy GSA layer
N tropy Performance Features • GSA allows performance features to be provided “under the hood”: • Interprocessor data caching • < 1 in 100,000 off-PE requests actually result in communication. • RMI allows further performance features • Dynamic load balacing • Workload can be dynamically reallocated as computation progresses.
N tropy Performance 10 million particles Spatial 3-Point 3->4 Mpc Interprocessor data cache, Load balancing Interprocessor data cache, No load balancing No interprocessor data cache, No load balancing
0 1 8 2 5 9 12 3 4 6 7 10 11 13 14 Why does the data cache make such a huge difference? Proc. 0 myGravityFunc()
N tropy “Meaningful” Benchmarks • The purpose of this library is to minimize development time! • Development time for: • Parallel N-point correlation function calculator • 2 years -> 3 months • Parallel Friends-of-Friends group finder • 8 months -> 3 weeks
Conclusions • Most approaches for parallel application development rely on a single paradigm • Inhibits scalability • Inhibits generality • Almost all current HPC programs are written in MPI (“paradigm-less”): • MPI is a “lowest common denominator” upon which any paradigm can be imposed.
Conclusions • Many “real-world” problems, especially those involving irregular data structures, demand a combination of paradigms • N tropy provides: • Remote Method Invocation (RMI) • Globally Share Addressing (GSA)
Conclusions • Tools thatselectively deployseveral parallel paradigms (rather than just one) may be what are needed to parallelize applications that use irregular/adaptive/dynamic data structures. • More Information: • Go to Wikipedia and seach “Ntropy”