1 / 31

Conference Review Presented by: Utku Aydonat

PPoPP’06: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming New York, March 29-31, 2006. Conference Review Presented by: Utku Aydonat. Outline. Conference overview Brief summaries of sessions Keynote speeches & Panel Best paper. Conference Overview.

onaona
Download Presentation

Conference Review Presented by: Utku Aydonat

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. PPoPP’06:ACM SIGPLAN Symposium onPrinciples and Practice of Parallel ProgrammingNew York, March 29-31, 2006 Conference Review Presented by: Utku Aydonat

  2. Outline • Conference overview • Brief summaries of sessions • Keynote speeches & Panel • Best paper CARG

  3. Conference Overview • History: 90, 91, 93, 95, 97, 99, 01, 03, 05, 06 • Primary focus: anything related to parallel programming • Algorithms • Communication • Languages • 8 sessions, 26 papers • Dominating topics: multicores, parallelization techniques CARG

  4. Conference Overview PPoPP: Paper Acceptance Statistics Year Submitted Accepted Rate 2006 91 25 27% 2005 87 27 31% 2003 45 20 44% 1999 79 17 22% 1997 86 26 30% CARG

  5. Overview of Session • Communication • Languages • Performance Characterization • Shared Memory Parallelism • Atomicity Issues • Multicore Software • Transactional Memory • Potpourri CARG

  6. Session 1: Communication • “Collective Communication of Architectures that Support Simultaneous Communication over Multiple Links”,E.Chan, R.van de Geijn (UTexas), W. Gropp, R.Thakur (Argonne National Lab.) • Adopt MPI collective communication algorithms to supercomputer architectures that support simultaneous communication with multiple nodes. • Theoritically latency can be reduced; practically it is not achievable due to the algorithms and overheads. • “Performance Evaluation of Adaptive MPI”,Chao Huang, Gengbin Zheng (UIUC), Sameer Kumar (IBM T. J. Watson), Laxmikant Kale(UIUC) • Design and evaluate AMPI that supports processor virtualization • Benefits: load balancing, adaptive overlapping, independence from the available number of processors, etc. CARG

  7. Session 1: Communication • “Mobile MPI Programs on Computational Grids”,Rohit Fernandes, Keshav Pingali, Paul Stodghill (Cornell) • Checkpointing system for C programs using MPI • Able to take checkpoint on Alpha cluster and restart them on Windows • “RDMA Read Based Rendezvous Protocol for MPI over InfiniBand: Design Alternatives and Benefits”,Sayantan Sur, Hyun-Wook Jin, Lei Chai, Dhabaleswar K Panda (Ohio State) • A rendezvous protocol in MPI using RDMA read. • Increases communication / computation overlap. CARG

  8. Session 2: Languages • “Global-View Abstractions for User-Defined Reductions and Scans”, SteveJ. Deitz, David Callahan, Bradford L. Chamberlain (Cray), Lawrence Snyder (U. of Washington) • Chapel programming language developed by Cray Inc. as a part of DARPA High-Productivity Computing Systems program • Global view abstractions for user-defined reductions and scans • “Programming for Parallelism and Locality with Hierarchically Tiled Arrays”,Ganesh Bikshandi, Jia Guo, Daniel Hoeflinger (UIUC), Gheorghe Almasi (IBM T. J. Watson), Basilio B Fraguela (Universidade da Coruña), Maria Jesus Garzaran, David Padua (UIUC), Christoph von Praun (IBM T. J. Watson) • Hierarchically Tiled Arrays (HTAs) that define tiling structure for arrays • Reductions, mapping, scans, transpose, shift operations are defined. CARG

  9. MinK in Chapel Called for each element of A var minimums: [1..10] integer; minimums = mink(integer, 10) reduce A; CARG

  10. HTA CARG

  11. Session 3: Performance Characterization • “Performance characterization of bio-molecular simulations using molecular dynamics”,Sadaf Alam, Pratul Agarwal, Al Geist, Jeffrey Vetter (ORNL) • Investigated performance bottlenecks in MD applications on supercomputers • Found out that the implementations of algorithms are not scalable • “On-line Automated Performance Diagnosis on Thousands of Processors”,Philip C. Roth (ORNL), Barton P. Miller (U. of Wisconsin, Madison) • Distributed and scalable performance analysis tool • Can analyze large application with 1024 processes and present the results in a folded graph. CARG

  12. Session 3: Performance Characterization • “A Case Study in Top-Down Performance Estimation for a Large-Scale Parallel Application”,Ilya Sharapov, Robert Kroeger, Guy Delamarter (Sun Microsystems) • Performance estimation of HPC workloads on future architectures • Based on low-level analysis and scalability predictions. • Predicts the performance of Gyrokinetic Toroidal Code executed on Sun’s future architectures CARG

  13. Session 4: Shared Memory Parallelism • “Hardware Profile-guided Automatic Page Placement for ccNUMA Systems”,Jaydeep Marathe, Frank Mueller (North Carolina State U.) • Profiles memory accesses and places pages accordingly. • 20% performance improvement and 2.7% overhead. • “Adaptive Scheduling with Parallelism Feedback”,Kunal Agrawal, Yuxiong He, Wen Jing Hsu, Charles Leiserson (Mass. Inst. of Tech.) • Allocates processors to jobs based on the past parallelism of the job. • Uses R-trimmed mean for the feed-back. CARG

  14. Session 4: Shared Memory Parallelism • “Predicting Bounds on Queuing Delay for Batch-scheduled Parallel Machines”,John Brevik, Daniel Nurmi, Rich Wolski (UCSB) • Binomial Method Batch Predictor (BMBP) that bases its predictions on the past wait times. • Uses 95th percentile and its predictions are close to real wait times experienced. • “Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems”,Ayon Basumallik, Rudolf Eigenmann (Purdue) • Converts OpenMP applications to MPI based applications • Uses inspection loop to find non-local access and reorder loops. CARG

  15. OpenMP-to-MPI CARG

  16. Session 5: Atomicity Issues • “Proving Correctness of Highly-Concurrent Linearizable Objects Viktor Vefeiadis (U. of Cambridge)”,Maurice Herlihy (Brown U.), Tony Hoare (Microsoft Research Cambridge), Marc Shapiro (INRIA Rocquencourt & LIP6) • Proves the safety of concurrent objects using Rely-Guarantee method • Each thread’s rely condition should be satisfied and each thread’s guarantee condition implies other’s rely condition for every operation. • “Accurate and Efficient Runtime Detection of Atomicity Errors in Concurrent Programs”,Liqiang Wang, Scott D. Stoller (SUNY at Stony Brook) • Instruments the program and obtain profiling of memory accesses • Builds a tree of the conflicting accesses and applies some algorithms to prove conflict and view equivalency. CARG

  17. Session 5: Atomicity Issues • “Scalable Synchronous Queues”,William N. Scherer III (U. of Rochester), Doug Lea (SUNY Oswego), Michael L. Scott (U. of Rochester) • Best Paper • Details are coming up. CARG

  18. Session 6: Multicore Software • “POSH: A TLS Compiler that Exploits Program Structure”,Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn (UIUC), Karin Strauss, Jose Renau (UCSC), Josep Torrellas (UIUC) • TLS compiler that divides the program to tasks, prune the inefficient ones • Uses profiling to detect tasks that may violate frequently. • “High-performance IPv6 Forwarding Algorithm for Multi-core and Multithreaded Network Processors”,Hu Xianghui (U. of Sci. and Tech. of China), Xinan Tang (Intel), Bei Hua (U. of Sci. and Tech. of China) • New IPv6 forwarding algorithm optimized for Intel NPU features • Achieves 10Gbps speed for large routing tables with up to 400K entries. CARG

  19. Session 6: Multicore Software • “MAMA! A Memory Allocator for Multithreaded Architectures”,Simon Kahan, Petr Konecny (Cray Inc.) • A memory allocator that aggregate requests to reduce the fragmentation • Transforms contention to collaboration • Experiments with micro-benchmarks proves that it works CARG

  20. Session 7: Transactional Memory • “A High Performance Software Transactional Memory System For A Multi-Core Runtime”,Bratin Saha, Ali-Reza Adl-Tabatabai, Richard L. Hudson (Intel), Chi Cao Minh, Ben Hertzberg (Stanford) • Maps each memory location to a unique lock and acquires all the relevant locks before committing a transaction • Undo-logging, write-locking/read versioning, cache-line conflict detection • “Exploiting Distributed Version Concurrency in a Transactional Memory Cluster”,Kaloian Manassiev, Madalin Mihailescu, Cristiana Amza (UofT) • Transactional Memory system on commodity clusters for generic C++ and SQL applications • Diffs are applied by readers on demand and may violate writers. CARG

  21. Session 7: Transactional Memory • “Hybrid Transactional Memory”,Sanjeev Kumar (Intel), Michael Chu (U. of Mich.), Christopher Hughes, Partha Kundu, Anthony Nguyen (Intel) • Hardware and Software TM together • Extends DSTM • Conflict detection is based on loading and storing the state field of the object wrapper and the locator field. CARG

  22. Session 8: Potpourri • “Fast and Transparent Recovery for Continuous Availability of Cluster-based Servers”,Rosalia Christodoulopoulou, Kaloian Manassiev (UofT), Angelos Bilas (U. of Crete), Cristiana Amza (UofT) • Recovery from failure on virtual shared memory systems • Based on page replication on backup nodes • Fail-free overhead of 38% and recovery cost is below 600ms. • “Mimimizing Execution Time in MPI Programs on an Energy-Constrained, Power-Scalable Cluster”,Rob Springer1, David K. Lowenthal1, Barry Rountree (The U. of Georgia), Vincent W. Freeh (North Carolina State U.) • Finds the best # of processors + gear combination that minimizes power and execution time. • Found the optimum schedule in 50% of the programs by iterating 7% of search space. CARG

  23. Session 8: Potpourri • “Teaching parallel computing to science faculty: best practices and common pitfalls”,David Joiner (Kean U.), Paul Gray (U. of Northern Iowa), Thomas Murphy (Contra Costa College), Charles Peck (Earlham College) • Experience in teaching parallel programming in a community college CARG

  24. Keynote Speeches & Panel • “Parallel Programming and Code Selection in Fortress”,Guy L. Steele Jr., Sun Fellow, Sun Microsystems Laboratories • “Parallel Programming in Modern Web Search Engines”,Raymie Stata, Chief Architect for Search & Marketplace, Yahoo!, Inc. • “Software Issues for Multicore Systems”,Moderator: James Larus, (Microsoft Research), Panelists: Saman Amarasinghe (MIT), Richard Brunner (AMD), Luddy Harrison (UIUC), David Kuck (Intel), Michael Scott (U. Rochester), Burton Smith (Microsoft), Kevin Stoodley (IBM) CARG

  25. Guy L. Steele: “Parallel Programming and Code Selection in Fortress” • To do for Fortran what Java did for C • Dynamic compilation • Platform independence • Security model including type checking • Research funded in part by the DARPA through their High Productivity Computing Systems program • Don't build the language—grow it • Make programming notation closer to math • Ease use of parallelism • Can a feature be provided by a library rather than in compiler? • Programmers (especially library writers) need not fear subroutines, functions, methods, and interfaces for performance reasons CARG

  26. Guy L. Steele: “Parallel Programming and Code Selection in Fortress” • Type System: Objects and Traits • Traits: like interfaces, but may contain code • Primitive types are first-class • Booleans, integers, floats, characters are all objects • Transactional access to shared variables • Fortress “loops” are parallel by default • Programming language notation can become closer to mathematical notation CARG

  27. Guy L. Steele: “Parallel Programming and Code Selection in Fortress” CARG

  28. Panel: Software Issues for Multicore Systems • Performance Conscious Languages • Languages that increase programmer productivity while making it easier to optimize • New Compiler Opportunities • New languages that take performance seriously • Possible compiler support for using multicores for other than parallelism • Security Enforcement • Program Introspection • Meanwhile, vast majority of applications programmers have no idea about parallelism • More Dual-core mid-2006, Quad core in 2007 (AMD) • Software Architecture Challenges (debugging, profiling, making multi-threading easier, etc. CARG

  29. Panel: Software Issues for Multicore Systems • Some Successes in Using Multi-Core (OS support, transactional memory, virtualization, efficient JVMs) • Parallel software systems must be much simpler, architecturally, than sequential ones if they have a chance of holding together • We will struggle before finally accepting that the cache abstraction does not scale • Efficient point-to-point communication is required • Most success will be achieved on nonstandard multicore platforms like graphics processors, network processors, signal processors, where there is less investment in caches. • We need new apps to drive the interest towards multicores • Where will the parallelism come from? (dataflow, reduce/map/scan, speculative parallelization, etc.) CARG

  30. Panel: Software Issues for Multicore Systems • The explicit sacrifice of single-thread performance in favor of parallel performance • Most vulnerable communities • Those who have not previously been exposed to or had a need for parallel systems, for example .. • Typical client software, mobile devices • Server transactions with significant internal complexity • Those who chronically need to drive the maximum performance from their computer systems, for example .. • High performance computing • Gamers • Above 8 cores, we do not know if multi-cores will be useful or not CARG

  31. Readings For Future CARG • “Optimizing Irregular Shared-Memory Applications for Distributed-Memory Systems”,Ayon Basumallik, Rudolf Eigenmann (Purdue) • “POSH: A TLS Compiler that Exploits Program Structure”,Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn (UIUC), Karin Strauss, Jose Renau (UCSC), Josep Torrellas (UIUC) • “MAMA! A Memory Allocator for Multithreaded Architectures”,Simon Kahan, Petr Konecny (Cray Inc.) • “Hybrid Transactional Memory”,Sanjeev Kumar (Intel), Michael Chu (U. of Mich.), Christopher Hughes, Partha Kundu, Anthony Nguyen (Intel) CARG

More Related