230 likes | 371 Views
Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations. David Gobaud Computational Drug Discovery Stanford University 7 March 2006. Outline. Overview Background Delft Molecular Dynamics Processor GRAPE Protein Explorer Summary MDGRAPE-3 Chip
E N D
Protein Explorer: A Petaflops Special Purpose Computer System for Molecular Dynamics Simulations David Gobaud Computational Drug Discovery Stanford University 7 March 2006
Outline • Overview • Background • Delft Molecular Dynamics Processor • GRAPE • Protein Explorer Summary • MDGRAPE-3 Chip • Force Calculation Pipeline • J-Particle Memory and Control Units • System Architecture • Software • Cost • Questions
Overview • Protein Explorer • Petaflop special-purpose computer system for molecular dynamics simulations • High-precision screening for drug design • Large-scale simulations of huge proteins/complexes • PC cluster with special-purpose engines to perform the most time-consuming calculations • Dedicated LSI MDGRAPE-3 chip performs force calculations at 165 Gflops or higher • ETA 2006
Background • PCs are universal machines • Various applications • Hardware can be designed independent of applications • Obstacles to high-performance • Memory bandwidth bottleneck • Heat dissipation problem • Can be overcome by developing specialized architectures
Delft Molecular Dynamics Processor (DMDP) • Pioneered high-performance special-purpose systems • Not able to achieve effective cost-performance • Demanded too much time and money in development state • Speed of development is a crucial factor affecting cost-performance because electronic device technology continues to develop rapidly • Almost all calculations performed by DMDP making hardware very complex
GRAPE (GRAvity PipE) • One of the most successful attempts to develop high-performance special-purpose systems • Specialized for simulations of classical particles • Most time spent on calculation of long-range forces (gravitational, Coulomb, and van der Waals) • Thus special hardware only performs these calculations • Hardware very simple and cost-effective
GRAPE (GRAvity PipE) • In 1995 first machine to break teraflops barrier in nominal peak performance • Since 2001 leader in performance has been Molecular Dynamics Machine at RIKEN at 78-TFlops • 2002 @ University of Tokyo a 64-TFlop GRAPE-6 completed • Protein Explorer launched based on 2002 University of Tokyo success
Protein Explorer Summary • Host PC cluster with special purpose boards attached • Boards calculate only non-bounded forces • Very simple hardware and software • No detailed knowledge of hardware needed to write programs • Communication time between host and boards is proportional to number of particles • Calculation time proportional to • N^2 for direct summation of long-range forces • N*Nc for short range forces where Nc is the average number of particles within the cutoff radius • 0.25 byte/1000 operations
MDGRAPE-3 Chip - Force Calculation Pipeline • 3 subtractor units • 6 adder units • 8 multiplier units • 1 function-evaluation unit • Can perform ~33 equivalent operations/sec when it calculates the Coulomb force
MDGRAPE-3 Chip - Force Calculation Pipeline • Most operations done in 32-bit single precision floating point format • Force accumulation is 80-bit fixed point format • Can be converted to 64-bit double precision floating point • Coordinates stored in 40-bit fixed-point format • Makes implementation of periodic boundary condition easy
MDGRAPE-3 Chip - Force Calculation Pipeline • Function Evaluator • Most important part of pipeline • Allows calculation of arbitrary smooth function • Has memory unit which contains a table for polynomial coefficients and exponents and a hardwired pipeline for fourth-order polynomial evaluation • Interpolates an arbitrary smooth function g(x) using segmented fourth-order polynomials by Homer’s method
MDGRAPE-3 Chip - J-Particle Memory and Control Units • 20 Force Calculation Pipelines • j-Particle Memory Unit • 32,768 bodies • “Main Memory” • 6.6 Mbits constructed by static RAM • Cell-Index Controller • Controls j-Particle memory – generates addresses • Force Simulation Unit • Master Controller • Manages timings and inputs/outputs of the chip
MDGRAPE-3 Chip • 2 virtual pipelines/physical pipeline • Physical bandwidth of j-particle unit 2.5 Gbytes/sec but virtual bandwidth will reach 100 Gbytes/sec • 340 arithmetic units • 20 function-evaluator units which work simultaneously • 165 Gflops at 250MHz
MDGRAPE-3 Chip • Chip made by Hitachi • 6M gates • 10M bits of memory • Chip size is ~220 mm^2 • Dissipate 20 watts at core voltage of +1.2V • .12 W/Gflops much better than P4 3GHz which is 14 W/Gflop
System Architecture • Host PC cluster will use Itanium or Opteron CPU • 256 nodes with 512 CPUs each • Performance of node is 3.96 Tflops • Total reaches a petaflop • Require 10G-bit/sec network • Infiniband 10G Ethernet or future Myrinet • Network topology will be a 2D hyper-crossbar • Each node has 24 MDGRAPE-3 chips • MDGRAPE-3 chips connected via 2 PCI-X busses at 133 MHz • 19” rack can house 6 nodes • 43 racks total • Power dissipation ~150 KWatts • Occupy 100 m^2
Software • Very easy to create programs for • All computational abilities provided in a library • No special knowledge of device needed
Cost • $20 million including labor • Less than $10/Gflop • At least ten times better than general-purpose computers even when compared with relatively cheap BlueGene/L ($140/Gflop)
Questions • What is Myrinet? • What is a two-dimensional hyper-crossbar network topology? • How does this compare to massive distributed computing such as Folding@Home • Advantages? • Disadvantages?