1 / 31

Building a Petaflop Supercomputer: Examples and Ranking

This article discusses examples of Petaflop supercomputers from 2008 and the ranking system used to measure their performance. It also explores the possibility of matching the computational power of the human brain.

trevora
Download Presentation

Building a Petaflop Supercomputer: Examples and Ranking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Computer Architecture5MD00 / 5Z033TOP 500supercomputers Henk Corporaal www.ics.ele.tue.nl/~heco/courses/aca h.corporaal@tue.nl TUEindhoven 2011

  2. Topics • How to cross the Petaflop boundary • Ranking • Nov 2008 • Nov 2009 / Nov 2010: what has been changed • Examples • Roadrunner (IBM) • Jaguar Cray • SGI Altix • BlueGene ACA H.Corporaal

  3. How to build a Petaflop supercomputer? Some examples from 2008: • Opteron cluster (e.g. ~2X Ranger/TACC) • 32,000 quad-core Opterons (130K cores) • Cray XT3/4 (e.g. Baker/ORNL sooner) • 32,000 quad-core Opterons (130K cores) • IBM BlueGene/P (bigger sooner) • 80,000 BG/P PPC processors (320K cores) • IBM Cell-accelerated Roadrunner cluster • 10,000 Cells (80K Cell SPUs) ACA H.Corporaal

  4. Supercomputer Ranking • Started in 1993 • Jack Dongarra, University of Tennessee • Based on LINPACK benchmark • linear algebra (LU factorization) • Superseded by LAPACK • based on BLAS (Basic Lin. Alg. Subprograms) • exploits caches • Measures Floating Point performance • Fortran code • see http://www.top500.org ACA H.Corporaal

  5. Single-Chip GPU v.s. Fastest Super Computers ref: http://www.llnl.gov/str/JanFeb05/Seager.html

  6. Performance Ranking Nov. 2008 ACA H.Corporaal

  7. Performance Ranking 2008: we crossed the Petaflop boundary ACA H.Corporaal

  8. Update November 2009 ACA H.Corporaal

  9. Update November 2010 ACA H.Corporaal

  10. Alternative ranking: Green500 • Most Power efficient Supercomputers • 2008: best result = 536 MFlops/Watt => 1.87 nJ / FloatingPt_operation • 2009: best result = 723 MFlops/Watt => 1.38 nJ / FloatingPt_operation • Cell cluster, ranking 110 in top500 • 2010: best result = 1684 MFlops/Watt => 594 pJ / FloatingPt operation • IBM BlueGene/Q • See www.green500.org ACA H.Corporaal

  11. Nr1 (2008): Roadrunner • IBM cluster • 6480 nodes with • Dual core Opteron 1.8 GHz • 2 * PowerXCell 8i 3.2 GHz (12.8 GFlops) • Infiniband connection fabric (16 Gbit/s per link) • FAT tree interconnect • 100 Tbyte DRAM memory • 216 I/O nodes • MPI programming • 2.35 MW power !! • Size: 296 racks, 5500 ft2 This is huge !! ACA H.Corporaal

  12. Cell/B.E. – the architecture • 1 x PPE 64-bit PowerPC • L1: 32 KB I$ + 32 KB D$ • L2: 512 KB • 8 x SPE cores: • Local store: 256 KB • 128 x 128 bit vector registers • Hybrid memory model: • PPE: Rd/Wr • SPEs: Asynchronous DMA • EIB: 205 GB/s sustained aggregate bandwidth • Processor-to-memory bandwidth: 25.6 GB/s • Processor-to-processor: 20 GB/s in each direction ACA H.Corporaal

  13. ACA H.Corporaal

  14. Roadrunner: TriBlade = 2 nodes For more details: Presentation slides of Ken Koch, March 2008 ACA H.Corporaal

  15. Nr2 (2008): Jaguar Cray XT5 QC • I guess 5 times • 7832 quad-core 2.1 GHz AMD Opetron • 62 TB memory (= 2GB / core) • 600 TB file system • 250 TFlop • In total 150152 cores • SeaStar2+ interconnect (from Cray) • Note 2009: quad-cores replaced by six-cores • now nr 1 • 224,256 cores • peak 1.75 PetaFlop • paper: Bland A.S., Kendall R.A., Kothe D.B., Rogers J.H., Shipman G.M. Jaguar: The World’s Most Powerful Computer ACA H.Corporaal

  16. Jaguar ACA H.Corporaal

  17. Nr3 (2008): SGI Altix ICE8200 • 92 racks of Al5x ICE • 8200EX with 3.0 Ghz Intel Xenon quad-core processors or • 47,104 cores • 8 racks of Al5x ICE 8200 • with 2.66 Ghz Intel quad-core • 4096 cores. • 51 TB Main memory • DDR InfiniBand ACA H.Corporaal

  18. Nr:4 (2008) BlueGene/L IBM • Based on ASIC with PowerPC 440, 700 Mhz, each 2.8 GFlops • 105,496 nodes • 3D Torus interconnect for p2p communication + Collective network 3D-torus Complete system rack ACA H.Corporaal

  19. BlueGene/L ASIC node ACA H.Corporaal

  20. BlueGene/L Node board • 16 cards with 2 ASICs each • 8 GB • 180 Gflop ACA H.Corporaal

  21. 2009: BlueGene/P System: 256 racks upto 1PB 3.56 PFlops Rack: 32 Node Cards 13.9 TF/s 2-4 TB Node card: 32 processor cards 64-128 GB 435 GFlops Processor card: one 4-processor chip 13.6 GFlops 2-4 GB ASIC: 13.6 Gflops 8 MB EDRAM ACA H.Corporaal

  22. BlueGene/P ASIC ACA H.Corporaal

  23. PPC450: Exploiting SIMD • Two FPUs • 2 x 32 64-bit registers • SIMD • Datapath width = 16 bytes • Feeds two FPUs with 8 bytes each every cycle • Two FP multiply-add operations per cycle • 3.4 GFLOP/s peak performance ACA H.Corporaal

  24. BlueGene/PASIC • 208M trans • 850 MHz • 16W • 90nm ACA H.Corporaal

  25. BlueGene/P node card ACA H.Corporaal

  26. Next: BlueGene/Q • 10 PFlops in 2011-2012 • see www.research.ibm.com/bluegene ACA H.Corporaal

  27. Can we match the human brain ??? • Performance = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 200 (2 * 10^2) Calculations Per Second Per Connection = 2 * 10^16 Calculations Per Second • Memory = 100 Billion (10^11) Neurons * 1000 (10^3) Connections/Neuron * 10 bytes (information about connection strength and adress of output neuron, type of synapse) = 10^15 bytes = 1 PB = 1000 TB How far off are we? ACA H.Corporaal

  28. Software replica of one column of the neocortex cortex: 85% of brains total mass required for language, learning, memory and complex thought the essential first step to simulating the whole brain Next: include circuitry from other brain regions and eventually the whole brain. Blue brain research ACA H.Corporaal

  29. Latest news: factorization of RSA768 • RSA used to encypher text using both public and private key • EPFL, CWI and others have broken RSA768 • This means: Factorize 768 bit number into 2 primes • Using 1700 AMD 2.2 GHz cores for 1 year =>15 Mh (single core) compute time • Current RSA standard uses 1024 bits • still save for some years ACA H.Corporaal

  30. RSA (Rivest, Shamir, Adleman) • choose 2 (large) primes p and q • n = p*q • choose e such that e and (p-1)(q-1) are coprime (i.e. do not share prime factors) • choose d such d*e = 1 mod ((p-1)(q-1)) • public key = (n,e) private key = (n,d) • Encryption of message m: c=me mod n • Decryption of cypher c: m = cd mod n • see wikipedia for details and working example ACA H.Corporaal

  31. RSA factorization result • factorization of RSA768, the following 768-bit, 232-digit number from RSA's challenge list: • 12301866845301177551304949583849627207728535695953347921973224215172640050726365751874520219978646938995647494277406384592519255732630345373154826850791702612214291346167042921431160222124047927473779408066535141959745985 6902143413=33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489*36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917 ACA H.Corporaal

More Related