270 likes | 424 Views
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan. by Emre Tapcı. Outline. Introduction Specification of CP-PACS Pseudo Vector Processor PVP-SW Interconnection Network of CP-PACS Hyper-crossbar Network
E N D
“The Architecture of Massively Parallel Processor CP-PACS”Taisuke Boku, Hiroshi Nakamura, et al.University of Tsukuba, Japan by Emre Tapcı
Outline • Introduction • Specification of CP-PACS • Pseudo Vector Processor PVP-SW • Interconnection Network of CP-PACS • Hyper-crossbar Network • Remote DMA message transfer • Message broadcasting • Barrier synchronization • Performance Evaluation • Conclusion, References, Questions & Comments
Introduction • CP-PACS: Computational Physics by Parallel Array Computer Systems • To construct a dedicated MMP for computational physics, study Quantum-Chromo Dynamics • Center for Computational Physics, University of Tsukaba, Japan
Specification of CP-PACS • MIMD parallel processing system with distributed memory. • Each Processing Unit (PU) has a RISC processor and a local memory. • 2048 of such PU’s, connected by an interconnection network. • 128 IO units, that support a distributed disk space.
Specification of CP-PACS • Theoretical performance • To be able to solve problems like QCD, Astro-fluid dynamics, etc. a grat number of PUs are required. • For budget, reliability reasons, number of PUs is limited at 2048.
Specification of CP-PACS • Node processor • Improve function of node processors first. • Caches do not work efficiently on ordinary RISC processors. • New technique for cache function is introduced: PVP-SW
Specification of CP-PACS • Interconnection Network • 3-dimensional Hyper-Crossbar (3-D HXB) • Peak throughput of a single link: 300 MB/sec • Provides • Hardware message broadcasting • Block-stride message transfer • Barrier synchronization
Specification of CP-PACS • I/O system • 128 I/O units, equipped with RAID-5 hard disk system. • 528 GB total system disk space. • RAID-5 system increases fault tolerance.
Pseudo Vector Processor PVP-SW • MPPs require high performance node processors. • A node processor cannot achieve high performance unless cache system works efficiently. • Little temporal locality exists • Data space of application is much larger than cache size.
Pseudo Vector Processor PVP-SW • Vector processors • Main memory is pipelined. • Vector length of load/store is long. • Load/store is executed in parallel with arithmetic execution. • We require these in our node processor • PVP-SW is introduced. • It is pseudo-vector.
Pseudo Vector Processor PVP-SW • Cannot increase number of registers, register field in instructions is limited. • So, a new technique, Slide-Windowed Registers is introduced.
Pseudo Vector Processor PVP-SW • Slide-Windowed Registers • Physical registers consist of logical windows, a window consists of 32 registers. • Total number of registers is 128. • Global registers & Window registers • Global registers are static and shared by all windows • Local registers are not shared. • One window active at a certain time.
Pseudo Vector Processor PVP-SW • Slide-Windowed Registers • Active window is identified by a pointer, FW-STP. • New instructions are introduced, to deal with FW-STP: • FWSTPSet: Sets new location for FW-STP. • FRPreload: Load data from memory into a window. • FRPoststore: Store data into memory from a window.
Interconnection Network ofCP-PACS • Topology is a Hyper-Crossbar Network (HXB) • 8 x 17 x 16, 2048 PUs, 128 I/O units. • On a dimension of hypercube, the PUs are interconnected by a crossbar. • For example: On Y dimension, a Y x Y size crossbar is used. • Routing is simple, route on 3 dimensions consecutively. • Wormhole routing is employed.
Interconnection Network ofCP-PACS • Wormhole routing & HXB together has these properties: • Small network diameter • Same sized torus can be simulated. • Message broadcasting by hardware. • Binary hypercube can be emulated. • Througput in even random transfer is high.
Interconnection Network ofCP-PACS • Remote DMA transfer • Making a system call to OS and copying data to OS area is messy. • Instead, access remote node’s memory directly. • Remote DMA is good, because: • Mode switching (kernel/user mode) is tedious. • Redundant data copying (user kernel space) is not done.
Interconnection Network ofCP-PACS • Message Broadcasting • Supported by hardware. • First, perform on one dimension • Then perform on other dimensions • Hardware mechanisms to prevent deadlock caused by two nodes broadcasting at the same time are present. • Hardware partitioning is possible. • Send broadcast message to nodes in the sender’s partition only.
Interconnection Network ofCP-PACS • Barrier Synchronization • A synchronization mechanism is required in IPC systems. • CP-PACS supports a hardware barrier synchronization facility. • Makes use of special syncronization packets, other than usual data packets. • CP-PACS also supports partitioned pieces of network to use barrier synchronization.
Performance Evaluation • Based on LINPACK benchmark. • LU decomposition of a matrix. • Outer product method is used, based on 2-dimensional block-cyclic distribution. • All floating point and data loading/storing operations are done in PVP-SW manner.
Conclusion • CP-PACS is operational in University of Tsukuba. • Working on large scale QCD calculations. • Sponsored by Hitachi Ltd. & Grant-in-aid of Ministry of Education, Science of Culture, in Japan.
References • T.Boku, H. Nakamura, K. Nakazawa, Y. Iwasaki, The architecture of Massively Parallel Processor CP-PACS, Institute of Information Sciences and Electronics, University of Tsukuba