110 likes | 251 Views
Fabrizio Petrini, IBM TJ Watson. Scalable Algorithms for Massive-scale Graphs. KAUST, SC11 November 2011 . Linkedin Network. Data in real-world: Large Complex Networks. Internet :
E N D
Fabrizio Petrini, IBM TJ Watson Scalable Algorithms for Massive-scale Graphs KAUST, SC11 November 2011
Data in real-world: Large Complex Networks Internet : Nodes are the routers, edges are the connections. Finding network topology, solving bandwidth, flow and shortest paths problems.(1.8 billion users of the internet and growing) Network Security: Large graphs that can be use to dynamically track network behavior and identify potential attacks Social Networks Internet Finance Biological Transportation 1Data source: DIMACS, Wikipedia
Data in real-world: Large Complex Networks Social networks: Nodes represent people, joined through personal or professional acquaintance, or through groups and communities. Social Networks Facebook : 100 to 150 million users in 4 months, >400 million currently Twitter : 75 million users (Jan 2010) 6.2 million new users per month Internet Finance Biological Transportation 1Data source: Nielson Report, The Inquirer
Blue Gene/Q 4. Node Card: 32 Compute Cards, Optical Modules, Link Chips; 5D Torus 3. Compute card: One chip module, 8/16 GB DDR3 Memory, Heat Spreader for H2O Cooling 2. Single Chip Module 1. Chip: 16+2 P cores 5b. IO drawer: 8 IO cards w/16 GB 8 PCIe Gen2 x8 slots 3D I/O torus 7. System: 96 racks, 20PF/s 5a. Midplane: 16 Node Cards • Sustained single node perf: 10x P, 20x L • MF/Watt: (6x) P, (10x) L (~2GF/W, Green 500 criteria) • Software and hardware support for programming models for exploitation of node hardware concurrency 6. Rack: 2 Midplanes
Message Latency Message Rate Single threaded 134 1184 pclk 1150 pclk 779 pclk 245 530 pclk Our Graph500 implementation relies on SPI Total MPI Latency 2864pclk Total MPI Overhead 1172pclk
L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF PPC PPC PPC PPC 2MB L2 2MB L2 2MB L2 FPU FPU FPU FPU L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF L1 PF 2MB L2 L1 PF 2MB L2 L1 PF 2MB L2 L1 PF PPC FPU External DDR3 Blue Gene/Q chip architecture DDR-3 Controller • 16+1 core SMP Each core 4-way hardware threaded • Transactional memory and thread level speculation • Quad floating point unit on each core 204.8 GF peak node • Frequency: 1.6 GHz • 563 GB/s bisection bandwidth to shared L2 (Blue Gene/L at LLNL has 700 GB/s for system) • 32 MB shared L2 cache • 42.6 GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip kill protection) • 10 intra-rack interprocessor links each at 2.0GB/s • one I/O link at 2.0 GB/s • 16 GB memory/node • 55 watts chip power DDR-3 Controller External DDR3 full crossbar switch PPC L1 PF 2 GB/s I/O link (to I/O subsystem) Network FPU dma Test 10*2GB/s intra-rack & inter-rack (5-D torus) Blue Gene/Q compute chip PCI_Express note: chip I/O shares function with PCI_Express IBM Confidential
PPC PPC PPC PPC PPC PPC PPC PPC PPC PPC PPC L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 PF PF PF PF PF PF PF PF PF PF PF FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU FPU switch 2MB L2 2MB L2 2MB L2 2MB L2 Scalable Atomic Operation(fetch_and_inc for example – queuing lock) • 1 round trip + 4 L2 cycles • Where N is the number of threads • For N=64 and L2 75 cycles 331 cycles • Compared to 9600 cycles for standard 1 1.2 2MB L2 2MB L2 PPC L1 PF FPU PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF FPU 1.3 1.1 2MB L2 2MB L2 PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF PPC L1 PF FPU FPU PPC L1 PF FPU IBM Confidential