400 likes | 577 Views
CGO 2006: The Fourth International Symposium on Code Generation and Optimization New York, March 26-29, 2006. Conference Review Presented by: Ivan Matosevic. Outline. Conference overview Brief summaries of sessions Keynote speeches Best paper. Conference Overview.
E N D
CGO 2006:The Fourth International Symposium on Code Generation and OptimizationNew York, March 26-29, 2006 Conference Review Presented by: Ivan Matosevic
Outline • Conference overview • Brief summaries of sessions • Keynote speeches • Best paper
Conference Overview • Primary focus: back-end compilation techniques • Static analysis and optimization • Profiling • Run-time techniques • 8 sessions, 29 papers • Dominating topics: multicores, dynamic compilation
Overview of Session • Dynamic Optimization • Object-Oriented Code Generation and Optimization • Phase Detection and Profiling • Tiled and Multicore Compilation • Static Code Generation and Optimization Issues • SIMD Compilation • Optimization Space Exploration • Security and Reliability
Session 1: Dynamic Optimization • Kim Hazelwood (University of Virginia), Robert Cohn (Intel), A Cross-Architectural Interface for Code Cache Manipulation • Pin dynamic instrumentation system with code cache • The paper describes an API for various operations with the code cache (callbacks, lookups, statistics, etc.) • Derek Bruening, Vladimir Kiriansky, Tim Garnett, Sanjeev Banerji (Determina Corporation), Thread-Shared Software Code Caches • Problem: sharing a code cache across multiple threads • Authors propose a fine-grained locking scheme • Evaluation using DynamoRIO
Session 1: Dynamic Optimization • Keith Cooper, Anshuman Dasgupta (Rice Univ.), Tailoring Graph-coloring Register Allocation For Runtime Compilation • Problem: register allocation in JIT compilers • Authors propose a novel lightweight graph-colouring technique • Weifeng Zhang, Brad Calder, Dean Tullsen (UC San Diego), A Self Repairing Prefetcher in an Event-Driven Dynamic Optimization Framework • Extension of the Trident event-driven dynamic optimization framework (previously proposed by the same authors) • Dynamic insertion of prefetching instructions based on run-time analysis
Session 2: Object-Oriented CodeGeneration and Optimization • Suresh Srinivas, Yun Wang, Miaobo Chen, Qi Zhang, Eric Lin, Valery Ushakov, Yoav Zach, Shalom Goldenberg (Intel Corporation), Java JNI Bridge: An MRTE Framework for Mixed Native ISA Execution • Use a dynamic translator for the execution of native calls to one ISA on a different ISA’s Java platform • Kris Venstermans, Lieven Eeckhout, Koen De Bosschere (Ghent University), Space-Efficient 64-bit Java Objects through Selective Typed Virtual Addressing • Use address bits on a 64-bit architecture to encode object type in order to save memory • Objects of the same type allocated in a contiguous (virtual) region
Session 2: Object-Oriented CodeGeneration and Optimization • Daryl Maier, Pramod Ramarao, Mark Stoodley, Vijay Sundaresan (IBM Canada), Experiences with Multi-threading and Dynamic Class Loading in a Java Just-In-Time Compiler • The IBM TestaRossa JIT compiler • This paper focuses on code patching and profiling in a multi-threaded environment with a lot of class loading/unloading • Lixin Su, Mikko H Lipasti (University of Wisconsin Madison), Dynamic Class Hierarchy Mutation • Run-time reassignment of objects from one derived class to another, changing its virtual tables • Offers opportunity for optimizations based on specialization
Session 3: Phase Detection and Profiling • Priya Nagpurkar, (UCSB), Michael Hind (IBM), Chandra Krintz, (UCSB), Peter Sweeney, V.T. Rajan (IBM), Online Phase Detection Algorithms • Detecting phase behaviour in virtual machines • Track dynamic program parameters (methods invoked, branch directions…) over time and apply a similarity model • Jeremy Lau, Erez Perelman, Brad Calder (UC San Diego), Selecting Software Phase Markers with Code Structure Analysis • Portions of code whose execution correlates with phase changes • Procedure calls and returns, loop boundaries • Profile-based hierarchical loop-call graph
Session 3: Phase Detection and Profiling • Shashidhar Mysore, Banit Agrawal, Timothy Sherwood, Nisheeth Shrivastava, Subhash Suri (UC Santa Barbara), Profiling over Adaptive Ranges • Voted best paper – details later • Hyesoon Kim, Muhammad Aater Suleman, Onur Mutlu, Yale N. Patt (UT-Austin), 2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set • Predicts whether the prediction accuracy of each branch will vary across input sets • Heuristic approach used to derive representative profiling results from a single input set
Session 4: Tiled and Multicore Compilation • David Wentzlaff, Anant Agarwal (MIT), Constructing Virtual Architectures on a Tiled Processor • Map components of a superscalar architecture (Pentium III) onto a parallel tiled architecture (Raw) using dynamic translation • In a way, uses Raw as a coarse-grain FPGA • Aaron Smith, (UT-Austin), J. Burrill, (UMass at Amherst), J. Gibson, B. Maher, N. Nethercote, B. Yoder, D. Burger, K. S. McKinley (UT-Austin), Compiling for EDGE Architectures • TRIPS EDGE (Explicit Data Graph Execution) architecture • This paper focuses on compilation of standard C and FORTRAN benchmarks
Session 4: Tiled and Multicore Compilation • Shih-wei Liao, Zhaohui Du, Gansha Wu, Guei-Yuan Lueh (Intel), Data and Computation Transformations for Brook Streaming Applications on Multiprocessors • Parallel compiler for the Brook streaming language • An extension of C that enables specifying data parallelism • Michael L. Chu, Scott A. Mahlke (University of Michigan), Compiler-directed Object Partitioning for Multicluster Processors • Partitioning of data in clustered architectures such as Raw • I didn’t really understand what programming model these authors have in mind?
Session 5: Static Code Generation andOptimization Issues • Two papers about the HPUX Itanium compiler: • Dhruva R. Chakrabarti, Shin-Ming Liu (Hewlett-Packard), Inline Analysis: Beyond Selection Heuristics • Cross-module techniques for selection of inlined call sites and the choice of specialized function versions • Robert Hundt, Dhruva R. Chakrabarti, Sandya S. Mannarswamy (Hewlett-Packard), Practical Structure Layout Optimization and Advice • Data layout and placement on the heap to improve locality • Structure splitting, structure peeling, dead field removal, and field reordering
Session 5: Static Code Generation andOptimization Issues • Chris Lupo, Kent Wilken (University of California, Davis), Post Register Allocation Spill Code Optimization • Authors propose a profile-based algorithm for placement of save/restore instructions handling spilled variables in function calls • Implemented as a part of GCC • Seung Woo Son, Guangyu Chen, Mahmut Kandemir (Pennsylvania State University), A Compiler-Guided Approach for Reducing Disk Power Consumption by Exploiting Disk Access Locality • Goal: restructure code so that disk idle periods are lengthened • The approach targets array-based programs: disk layout of array data exposed to the compiler
Session 6: SIMD Compilation • Jianhui Li, Qi Zhang, Shu Xu, Bo Huang (Intel China Software Center), Optimizing Dynamic Binary Translation for SIMD Instructions • Algorithms for dynamic binary translation of SIMD instructions in general-purpose architectures (such as MMX in x86) • Evaluation using IA-32 binaries on Itanium 2 • Dorit Nuzman (IBM), Richard Henderson (Red Hat), Multi-Platform Auto-Vectorization • Implementation of automatic vectorizer for GCC 4.0
Session 7: Optimization-space Exploration • Felix Agakov, Edwin Bonilla, John Cavazos, Bjoern Franke, Grigori Fursin, Michael O'Boyle, Marc Toussaint, John Thomson, Chris Williams (U. of Edinburgh), Using Machine Learning to Focus Iterative Optimization • Predictive modelling used to search the optimization space • Targets embedded platforms – AMD Au1500 and Texas Instruments TI C6713 • Prasad Kulkarni, David Whalley, Gary Tyson (Florida State University), Jack Davidson (University of Virginia), Exhaustive Optimization Phase Order Space Exploration • Exhaustive search of the phase order space (15 phases) using aggressive pruning; takes time on the order of minutes to hours • Targets StrongARM SA-100
Session 7: Optimization-space Exploration • Zhelong Pan, Rudolf Eigenmann (Purdue University), Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning • Problem: find the optimal combination of 38 GCC O3 options, targeting Pentium IV and Sparc II • Proposed heuristic algorithm that provides s quality solution in time on the order of several hours
Session 8: Security and Reliability • Edson Borin, (UNICAMP), Cheng Wang, Youfeng Wu (Intel), Guido Araujo (UNICAMP), Software-Based Transparent and Comprehensive Control-Flow Error Detection • Addresses the problem of soft (transient) errors that cause branches to incorrect instructions • Implemented in SW as a part of a dynamic binary translator • Tao Zhang, Xiaotong Zhuang, Santosh Pande (Georgia Tech), Compiler Optimizations to Reduce Security Overheads • Optimizations that specifically target techniques that implement software protection with minimal HW support
Session 8: Security and Reliability • Susanta Nanda, Wei Li, Tzi-cker Chiueh (State University of NY at Stony Brook), BIRD: Binary Interpretation using Runtime Disassembly • Goal: framework for automatic detection of vulnerabilities such as buffer overflows when the source code is not available • Static and dynamic disassembly and instrumentation – targets Windows x86 application
Keynote Speeches • Wei Li, Principal Engineer, Intel: "Parallel Programming 2.0" • Kevin Stoodley, Fellow and CTO of Compilation Technology, IBM: "Productivity and Performance: Future Directions in Compilers"
Wei Li: Parallel Programming 2.0 • Major technological change: • Moore’s Law continues to increase transistor counts • However: power, memory latency, limits to ILP are setting an effective performance ceiling • General trend towards thread-level on-chip parallelism • SMT • Chip multiprocessors
Wei Li: Parallel Programming 2.0 • “Parallel Programming 2.0” refers to the advent of multicores • A very optimistic future vision:
Wei Li: Parallel Programming 2.0 • Key issue – where will the parallelism come from? • Parallel programming needs to become more mainstream • Consumer vs. HPC/server/database • Inclusion into education at more elementary level • New tools for greater ease of programming • Intel’s parallel programming tools • http://www.intel.com/software
K. Stoodley:"Productivity and Performance: Future Directions in Compilers" • Limits to traditional static compilation • Overview of IBM compiler technology • Testarossa JIT compiler, Toronto Portable Optimizer, Tobey backend • Challenges at present and near future • Software abstraction complexity – forces the scope of compilation to higher levels • Maintaining high performance backwards compatibility increasingly difficult
xlc xlC xlf Front Ends class class jar W-Code J9 Execution Engine (Java + Others) CPO Toronto Portable Optimizer (TPO) TOBEY Backend Testarossa JIT Dynamic Machine Code Binary Translation Profile-Directed Feedback (PDF) Static Machine Code K. Stoodley:"Productivity and Performance: Future Directions in Compilers" • Future: convergence/combination of dynamic and static compilation technologies
Best Paper • Shashidhar Mysore, Banit Agrawal, Timothy Sherwood, Nisheeth Shrivastava, Subhash Suri (UC Santa Barbara): Profiling over Adaptive Ranges
Profiling over Adaptive Ranges • Problem: how to count specific events efficiently and accurately? • Code segments executed • Memory regions accessed • IP addresses of routed packets • In all cases, impossible to maintain separate counters for the entire range of values • Each basic block, memory address, IP address…
Trade-off: Precision vs. Efficiency • Profiling with uniform ranges fails to distinguish hot code Uniform ranges Unlimited counters
Higher Precision for Hot Regions • Good trade-off with limited resources: • High precision for hot regions • Low precision for colder ones, but this affects the accuracy less • Challenge: how to determine what exactly to count with what precision?
Solution: Adaptive Profiling • Start with one counter; split counters as they become hot:
Solution: Adaptive Profiling • Start with one counter; split counters as they become hot:
Solution: Adaptive Profiling • Start with one counter; split counters as they become hot:
Counter Merging • Problem: what if program behaviour changes after the initialization phase?
Counter Merging • Problem: what if program behaviour changes after the initialization phase?
Counter Merging • Solution: perform counter merging along with splitting
Counter Merging • Counters of merged child nodes added to the parent
Counter Merging • Counters of merged child nodes added to the parent
Counter Merging • Problem: how to identify nodes for merging? • They are by definition those ones that are not updated frequently • Solution: periodic batched merge operations • Tree depth grows at logarithmic rate can be done at exponentially increasing intervals
Additional Contributions • Heuristics for splitting and merging • Theoretical analysis of accuracy guarantees • Proposal for hardware implementation • Experimental evaluation • Memory requirements • Average and worst-case errors on benchmarks • Performance of HW implementation • Accuracies on the order of 98.0-99.8% with only 8-64K of memory
Conclusions • Highly interesting program • My short presentation certainly doesn’t do justice to most of the mentioned works! • Readings to perhaps consider for future CARG: • D. Wentzlaff, A. Agarwal, Constructing Virtual Architectures on a Tiled Processor • A. Smith et al., Compiling for EDGE Architectures • F. Agakov et al., Using Machine Learning to Focus Iterative Optimization • (Highly subjective!)