150 likes | 302 Views
Introduction to Research 2007. Ashok Srinivasan Florida State University www.cs.fsu.edu/~asriniva. Recent collaborators V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu Florida State University S. Kapoor IBM Austin S. Namilae Oak Ridge National Lab.
E N D
Introduction to Research 2007 Ashok Srinivasan Florida State University www.cs.fsu.edu/~asriniva • Recent collaborators • V. Aggarwal, J. Kolhe, L. Ji, M. Mascagni, H. Nymeyer, and Y. Yu • Florida State University • S. Kapoor • IBM Austin • S. Namilae • Oak Ridge National Lab • M. Krishna, A. Kumar, N. Jayam, G. Senthilkumar, P. K. Baruah, and R. Sharma • Sri Sathya Sai University, India • N. Chandra • University of Nebraska at Lincoln • Research support • Funding • DoD, FSU, NSF • Computer time • IBM, NCSA, NERSC, ORNL
Outline • Research Areas • Computational Nanotechnology • Computational Biology • High Performance Computing on Multicore Processors • Potential Research Topics • Graduate Courses
Research Areas • High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software • Current topics: Computational Nanotechnology, Computational Biology, HPC on Multicore Processors • New Topics: Dynamic Data Driven Applications • Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression
Importance of Parallel Computing • Makes feasible products based on more fundamental understanding of science • Example: Nanotechnology, Medicine • Increasing relevance to industry • In 1993, fewer than 30% of top 500 supercomputers were commercial • Now, over 50% are commercial • Finance and insurance • Medicine • Aerospace and Automobiles • Telecom • Oil exploration • Shoes! (Nike) • Potato chips! • Toys!
Architectural Trends • Massive parallelism • 10K processor systems will be commonplace • Large end already has over 100K processors • Single chip multiprocessing • All processors will be multicore • Heterogeneous multicore processors • Cell used in the PS3 • 80-core processor from Intel • Processors with hundreds of cores are already commercially available • Distributed environments, such as the Grid • But it is hard to get good performance on these systems
Computational Nanotechnology • Example application • Carbon Nanotube • Can span 23,000 miles without failing due to own weight • 100 times stronger than steel • Lighter than feather • Conducts heat better than diamond • Computations are used to understand materials at the atomic scale, so that better materials can be designed • Easier than experimentation at the nano-meter scale
CNT Tensile Test • Pull the CNT at constant speed • Determine material properties from force-displacement response • Computational difficulties • Time steps size ~ 10 –15 seconds • Desired time range is much larger • A million time steps are required to reach 10-9 s • ~ 500 hours of computing for ~ 40K atoms using GROMACS • MD uses unrealistically large pulling speed • 1 to 10 m/s instead of 10-7 to10-5 m/s • Results at unrealistic speeds are unrealistic!
Difficulty with Parallelization • Results on scalable code • Does not scale efficiently beyond 10 ms/iteration • If we want to simulate to a ms • Time step 1 fs 1012 iterations 1010s ≈ 300 years • If we scaled to 10 s per iteration • 4 months computing time NAMD, 327K atom ATPase PME, Blue Gene, IPDPS 2006 NAMD, 92K atom ApoA1 PME, Blue Gene, IPDPS 2006 IBM Blue Matter, 43K Rhodopsin, Blue Gene,Tech Report 2005 Desmond, 92K atom ApoA1, SC 2006
Data Driven Time Parallelization • Each processor simulates a different time interval • Initial state is obtained by prediction, using prior data (except for processor 0) • Verify if prediction for end state is close to that computed by MD • Prediction is based on dynamically determining a relationship between the current simulation and those in a database of prior results If time interval is sufficiently large, then communication overhead is small
Results • Speedup result • Red line: Ideal speedup • Blue: v = 0.1m/s • Green: A different predictor • Experimental parameters • v = 1m/s, using v = 10m/s • CNT with 1000 atoms • Xeon/ Myrinet cluster • Validation • Compare stress strain response • Blue: Exact results • Red: Time parallel results • Green: Direct prediction
Computational Biology • Data driven time parallelization in the AFM simulation of proteins • An order of magnitude improvement in performance by combining conventional and data driven time parallelization with the protein Titin
High Performance Computing on Multicore Processors Cell Architecture DMA put times • Memory to Memory Copy using: • SPE local store • memcpy by PPE • A PowerPC core, with 8 co-processors (SPE) with 256 K local store each • Shared 512 MB - 2 GB main memory - SPEs can DMA • Peak speeds of 204.8 Gflops in single precision and 14.64 Gflops in double precision for SPEs • 204.8 GB/s EIB bandwidth, 25.6 GB/s for memory • Two Cell processors can be combined to form a Cell blade with global shared memory
Cell MPI Results • PE: Consider SPUs to be a logical hypercube – in each step, each SPU exchanges messages with neighbor along one dimension • DIS: In step i, SPU j sends to SPU j + 2i and receives from j – 2i Comparison of MPI_Barrier on different hardware MPI_Barrier timing Broadcast bandwidth
Potential Research Topics • Computational Biology • Data Driven Time Parallelization • Markov State Modeling • Other topics • Dynamic Data Driven Applications • Combining simulations and experiments in superplastic forming • High Performance Computing on Multicore Processors • Algorithms and libraries on the Cell processor • Example: Sorting, linear algebra, etc • Good software cache/code overlaying implementations • Other possible new directions • Applications in history, linguistics, medicine, etc
Graduate Courses • Parallel Computing, Spring 2008 • MPI and OpenMP programming on traditional parallel machines • Threaded programming on multicore processors • Parallel algorithms • Advanced Algorithms, Fall 2008 • Approximation algorithms for NP hard problems • Randomized algorithms • Cache aware algorithms