230 likes | 346 Views
Dilemma of Parallel Programming. Xinhua Lin ( 林新华 ) HPC Lab of SJTU @XJTU, 17 th Oct 2011 . Disclaimers. I am not funded by CRAY S lides marked with Chapel logo are taken from Brad Chamberlain’s talk ‘ The Mother of All Chapel Talks ’, with permission from himself
E N D
Dilemma of Parallel Programming Xinhua Lin (林新华) HPC Lab of SJTU @XJTU, 17thOct 2011
Disclaimers • I am not funded by CRAY • Slides marked with Chapel logo are taken from Brad Chamberlain’s talk ‘The Mother of All Chapel Talks’, with permission from himself • Funny pictures are from Internet
About me and HPC Lab in SJTU • Directing HPC Lab • Co-translator of PPP • Co-founder of HMPP CoCfor AP&Japan • As MS HPC Invitation institutes @SH • Support For HPC Center of SJTU • Hold SJTU HPC Seminar monthly http://itis.grid.sjtu.edu.cn/blog
Three Challenges for ParaProgin multi/many core era • Revolution V.S. Evolution • Low level V.S. High level • Performance V.S. Programmable • Performance V.S. Performance Portability For more detail: Paper Version: <中国教育网络> Special issue for HPC and Cloud, Sep 2011 Online Version: http://itis.grid.sjtu.edu.cn/blog
Outline • Right Level to expose Parallel • ParaProg languages Reviews • Multiresolution and Chapel
Can we stop water/parallel ? Language Library OS ISA Hardware
Performance V.S. Programmable High Level Low Level Higher-Level Abstractions ZPL HPF MPI Expose Implementing Mechanisms Target Machine OpenMP Target Machine pthreads Target Machine “Why don’t I have more control?” “Why is everything so tedious?”
ParaProg Education • Tired of teaching yet another specific lang. • MPI for Cluster • OpenMP for SMP then Multi-core CPU • CUDA for GPU, and now OpenCL • More on the way… • Had to explain concepts by different tools • Single lang. to explain them all? • Similar in OS education • Production OS: Linux, Unix and Window • OS only for education: Minix
Hybrid Programming Model • MPI is insufficient in multi/many core era • OpenMP for multi-core • CUDA/OpenCL for many-core* • So called Hybrid Programming was invented as a temporary solution, workable but ugly • MPI+OpenMP for Multi-core cluster • MPI+CUDA/OpenCL for GPU cluster like Tianhe-1A • Similar idea used in CUDAfor thread and thread-block, OpenCL for work-item and work-group * We will wait and see how OpenMP works on Intel MIC
ParaProg from different ways • Low Level (expose implementation mechanism ) • MPI, CUDAand OpenCL • OpenMP • High Level • PGAS: CAF, UPC and Tianuim • Global View: NESL, ZPL • APGAS: Chapel, X10 • Directive Based • HMPP, PGI, CRAY-directive
What is Mulutiesolution? Structure the language in a layered manner, permitting it to be used at multiple levels as required/desired • support high-level features and automation for convenience • provide the ability to drop down to lower, more manual levels • use appropriate separation of concerns to keep these layers clean Distributions language concepts Data parallelism Task Parallelism Locality Control Base Language Target Machine
Where Chapel was born: HPCS HPCS: High Productivity Computing Systems (DARPA et al.) • Goal: Raise productivity of high-end computing users by 10 • Productivity = Performance + Programmability + Portability + Robustness • Phase II: Cray, IBM, Sun (July 2003 – June 2006) • Evaluated the entire system architecture’s impact on productivity… • processors, memory, network, I/O, OS, runtime, compilers, tools, … • …and new languages: Cray: Chapel IBM: X10 Sun: Fortress • Phase III: Cray, IBM (July 2006 – 2010) • Implement the systems and technologies resulting from phase II • (Sun also continues work on Fortress, without HPCS funding)
fragmented ( ( ( + + )/2 + )/2 )/2 = = = Global-view V.S. Fragmented Problem: “Apply 3-pt stencil to vector” global-view ( + )/2 =
SPMD def main() { varn: int = 1000; varlocN: int = n/numProcs; vara, b: [0..locN+1] real; if(iHaveRightNeighbor) { send(right, a(locN)); recv(right, a(locN+1)); } if(iHaveLeftNeighbor) { send(left, a(1)); recv(left, a(0)); } foralliin1..locN { b(i) = (a(i-1) + a(i+1))/2; } } Global-view V.S. SPMD Code Global-View def main() { varn: int = 1000; vara, b: [1..n] real; foralliin 2..n-1 { b(i) = (a(i-1) + a(i+1))/2; } }
Chapel Overview • A design principle for HPC • “Support the general case, optimize for the common case” • Data Parallel (ZPL) + Task Parallel(CRAY MTA) + Script Lang. • Latest version 1.3.0 is available in as OSS: • http://sourceforge.net/projects/chapel Distributions Data parallelism Task Parallelism language concepts Locality Control Base Language Target Machine
Chapel example: Heat Transfer n A: repeat until max change < n 1.0 4
Chapel as Minix in ParaProg • If I were to offer a ParaProg class, I’d want to teach about: • data parallelism • task parallelism • concurrency • synchronization • locality/affinity • deadlock, livelock, and other pitfalls • performance tuning • …
Conclusion—Major Points • Programmable and Performance are always the dilemma of ParaProg • Multiresolution sounds perfect in theory but not mature enough for production • However, Chapel could be used as Minix in ParaProg