1 / 16

Issues of HPC software

Issues of HPC software. ——From the experience of TH-1A. Lu Yutong NUDT. TH-1A system. Installed in NSCC-TJ, Aug. 2010 Hybrid MPP structure: CPU & GPU Custom software stack Peak performance 4.7Pflop/s, Linpack 2.566PFlop/s, No.1Top500

baina
Download Presentation

Issues of HPC software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Issues of HPC software ——From the experience of TH-1A Lu Yutong NUDT

  2. TH-1A system Installed in NSCC-TJ, Aug. 2010 Hybrid MPP structure: CPU & GPU Custom software stack Peak performance 4.7Pflop/s, Linpack 2.566PFlop/s, No.1Top500 Power consumption 4.04MW(635.15MF/W), No.11Green500

  3. Hardware • 4 kinds of nodes, 2 sets of network • Custom cabinet • Water cooling system • 3 kinds of LSI chips • CPU: FT-1000(PSoC) • High radix router ASIC:NRC • Network interface ASIC:NIC • 15 kinds of PCB boards

  4. TH-1A software Software stack

  5. What’s the point • Customization • Optimization Oriented to the system architecture

  6. Operating system • Kylin Linux • compute node kernel • Provide virtual running environment • Isolated running environments for different users • Custom software package installation • QoS support • Power aware computing

  7. Glex communicating system • Proprietary Interconnection based on high radix router • High bandwidth packet and RDMA communication • Zero copy user space RDMA • MPI base on GLEX: Bandwidth 6.3GB/s • Accelerate collective operation with hardware support in communication interface • Fault tolerance • Rapid error detection in large scale interconnection • Rebuild communication links

  8. Resource management system • Resource management, job scheduler • Heterogenous resources management and topology-aware scheduling • Large scale parallel job launcher (custom) • Improve system structure, optimized protocol, network performance, file system performance • logN, klogN • System power management • Automatic CR supporting • Accounting Enhance

  9. Global parallel file system • Object storage architecture (Lustre based) • Capacity: 2 PB, Scalability • Performance: Collective BW • Optimized file system protocol over proprietary interconnection network • Confliction release for concurrency accessing • Fine-grain distributed file lock mechanism • Optimized file cache policy • Reliability enhancement • Fault tolerance of network protocol • Data objects placement • Soft-raid

  10. Compiler system • C, C++, Fortran, Java • OpenMP, MPI, OpenMP/MPI • CUDA, OpenCL • Heterogeneous programming framework • Accelerate the large scale, complex applications, • Use the computing power of CPUs and GPUs, hide the GPU programming to users • Inter-node homogeneous parallel programming (users) • Intra-node heterogeneous parallel computing (experts)

  11. Compiler Optimized • Accelerating HPL (MPI(custom)+OpenMP+Cuda) • Adaptive partition ● Asynchronous data transfer • Software pipeline ● Affinity scheduling ● Zero-copy

  12. Programming environment • Virtual running environments • Provide services on demand • Parallel toolkits • Based on Eclipse • To integrate all kinds of tools • Editor, debugger, profiler • Work flow support • Support QoS negotiate • Reservation

  13. Nowadays • Our principle • Practicality and Usability • HPC system software • Mature technology,correctness and functionality • Optimizing technology,increase performance , scalarbility and reliability • Needs of new technology and theory

  14. Programming Language • Productive language • X10, Chapel, Fortress • Global Arrays, …… • MPI is still vitality • Compiler independence • Distributed features • MPI + sth

  15. Exascale computing • Hardware architecture • Many cores system • Nano-Photonics • System Software-- (Adaptive + Intelligent) • Extreme parallelism • Heterogenous computing • Fault-tolerant computing • Power management • Application software • Legacy codes supporting • Creative model

  16. Thank you!

More Related