330 likes | 421 Views
Implementing Tomorrow's Programming Languages. Rudi Eigenmann Purdue University School of ECE Computing Research institute Indiana, USA. How to find Purdue University. Computing Research Institute (CRI). Other DP Centers: Bioscience Nanotechnology E-Enterprise Entrepreneurship
E N D
Implementing Tomorrow's Programming Languages Rudi Eigenmann Purdue University School of ECE Computing Research institute Indiana, USA
Computing Research Institute (CRI) Other DP Centers: Bioscience Nanotechnology E-Enterprise Entrepreneurship Learning Advanced Manufacturing Environment Oncology CRI is the high-performance computing branch of Discovery Park’s
Today: Tomorrow: DO I=1,n a(I)=b(I) ENDDO Do Weather forecast Subr doit Loop: Load 1,R1 . . . Move R2, x . . . BNE loop Compute on machine x Remote call doit Compilers are the Center of the Universe The compiler translates the programmer’s view into the machine’s view
Why is Writing Compilers Hard?… a high-level view • Translation passes are complex algorithms • Not enough information at compile time • Input data not available • Insufficient knowledge of architecture • Not all source code available • Even with sufficient information, modeling performanceis difficult • Architectures are moving targets
Why is Writing Compilers Hard?… from an implementation angle • Interprocedural analysis • Alias/dependence analysis • Pointer analysis • Information gathering and propagation • Link-time, load-time, run-time optimization • Dynamic compilation/optimization • Just-in-time compilation • Autotuning • Parallel/distributed code generation
It’s Even Harder Tomorrow Because we want: • Allour programs to work on multicore processors • Very High-level languages • Do weather forecast … • Composition: Combine weather forecast with energy-reservation and cooling manager • Reuse: warn me if I’m writing a module that exists “out there”.
How Do We Get There?Paths towards tomorrow’s programming language Addressing the (new) multicore challenge: • Automatic Parallelization • Speculative Parallel Architectures • SMP languages for distributed systems Addressing the (old) general software engineering challenge: • High-level languages • Composition • Symbolic analysis • Autotuning
The Multicore Challenge • We have finally reached the long-expected “speed wall” for the processor clock. • (this should not be news to you!) • “… one of the biggest disruptions in the evolution of information technology.” • “Software engineers who do not know parallel programming will be obsolete in no time.”
Automatic ParallelizationCan we implement standard languages on multicore? Polaris – A Parallelizing Compiler Standard Fortran … more specifically: a source-to-source restructuring compiler Research issues in such a compiler: • Detecting parallelism • Mapping parallelism onto the machine • Performing compiler techniques at runtime • Compiler infrastructure Polaris Fortran+directives (OpenMP) OpenMP backend compiler
5 4.5 4 3.5 Speedup 3 2.5 2 1.5 1 0.5 0 ARC2D FLO52Q HYDRO2D MDG SWIM 5 4 TOMCATV 3 2 1 TRFD State of the Art in Automatic parallelization • Advanced optimizing compilers perform well in 50% of all science/engineering applications. • Caveats: this is true • in research compilers • for regular applications, written in Fortran or C without pointers • Wanted: heroic, black-belt programmers who know the “assembly language of HPC”
Can Speculative Parallel Architectures Help? Basic idea: • Compiler splits program into sections (without considering data dependences) • The sections are executed in parallel • The architecture tracks data dependence violations and takes corrective action.
Performance of Speculative Multithreading SPEC CPU2000 FP programs executed on a 4-core speculative architecture.
We may needExplicit Parallel Programming Shared-memory architectures: OpenMP: proven model for Science/Engineering programs Suitability for non-numerical programs ? Distributed computers: MPI: the assembly language of parallel/distributed systems. Can we do better ?
Beyond Science&Engineering Applications 7+ Dwarfs: • Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement) • Unstructured Grids • Fast Fourier Transform • Dense Linear Algebra • Sparse Linear Algebra • Particles • Monte Carlo • Search/Sort • Filter • Combinational logic • Finite State Machine
Shared-Memory Programming for Distributed Applications? • Idea 1: • Use an underlying software distributed-shared-memory system (e.g., Treadmarks). • Idea 2: • Direct translation into message-passing code
OpenMP for Software DSM Challenges • S-DSM maintains coherency at a page level • Optimizations that reduce false sharing and increase page affinity are very important • In S-DSMs, such as TreadMarks, the stacks are not in shared address space • Compiler must identify shared stack variable interprocedural analysis P1 tells P2 “I have written page x” Processor 2 Processor 1 A[50] = barrier Shared address space Shared memory = A[50] Distributed memories P2 requests page “diff” from P1 stack stack stack stack stack stack stack stack t
Optimized Performance of SPEC OMPM2001 Benchmarks on a Treadmarks S-DSM System
Direct Translation of OpenMP into Message Passing A question often asked: How is this different from HPF? • HPF: emphasis is on data distribution OpenMP: the starting point is explicit parallel regions. • HPF: implementations apply strict data distribution and owner-computes schemes Our approach: partial replication of shared data. Partial replication leads to • Synchronization-free serial code • Communication-free data reads • Communication for data writes amenable to collective message passing. • Irregular accesses (in our benchmarks) amenable to compile-time analysis Note: partial replication is not necessarily “data scalable”
Performance of OpenMP-to-MPI Translation NEW EXISTS OpenMP-to-MPI Hand-coded MPI Higher is better Performance comparison of our OpenMP-to-MPI translated versions versus (hand-coded) MPI versions of the same programs. Hand-coded MPI represents a practical “upper bound” “speedup” is relative to the serial version
How does the performance compare to the same programs optimized for Software DSM? OpenMP for SDSM OpenMP-to-MPI NEW EXISTS (Project 2) Higher is better
How Do We Get There?Paths towards tomorrow’s programming language The (new) multicore challenge: • Automatic Parallelization • Speculative Parallel Architectures • SMP languages for distributed systems The (old) general software engineering challenge: • High-level languages • Composition • Symbolic analysis • Autotuning
Probably Domain-specific • How efficient? • Very efficient because there is much • flexibility in translating VHLLs • Inefficient by experience (Very) High-Level Languages ? Observation: “The number of programming errors is roughly proportional to the number of programming lines” Scripting, Matlab Object- oriented languages Fortran Assembly
CompositionCan we compose software from existing modules? • Idea: Add an “abstract algorithm” (AA) construct to the programming language • the programmer definines is the AA’s goal • called like a procedure Compiler replaces each AA call with a sequence of library calls • How does the compiler do this? It uses a domain-independent planner that accepts procedure specifications as operators
TypeProcedure 5 data types, 6 procedure calls Motivation: Programmers often Write Sequences of Library Calls Example: A Common BioPerl Call Sequence “Query a remote database and save the result to local storage:” Query q = bio_db_query_genbank_new(“nucleotide”, “Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”); DB db = bio_db_genbank_new( ); Stream stream = get_stream_by_query(db, q); SeqIO seqio = bio_seqio_new(“>sequence.fasta”, “fasta”); Seq seq = next_seq(stream); write_seq(seqio, seq); Example adapted from http://www.bioperl.org/wiki/HOWTO:Beginners
...and called like a procedure Seq seq = save_query_result_locally(“nucleotide”, “Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”, “>sequence.fasta”, “fasta”); Defining and Calling an AA • AA (goal) defined using the glossary... algorithm save_query_result_locally(db_name, query_string, filename, format) => { query_result(result, db_name, query_string), contains(filename, result), in_format(filename, format) } 1 data type, 1 AA call
“Ontological Engineering” • Library author provides a domain glossary • query_result(result, db, query) – result is the outcome of sending query to the database db • contains(filename, data) – file named filename contains data • in_format(filename, format) – file named filename is in format format
Implementing the Composition Idea (Compiler) (Executable) Plan User World Borrowing AI technology: planners -> for details, see PLDI 2006 (Library Specs.) Operators Planner Actions Plan (Call Context) Initial State (AA Definition) Goal State A A Domain-independent Planner
Symbolic Program Analysis • Today: many compiler techniques work assume numerical constants • Needed: Techniques that canreason about the program in symbolic terms. • differentiate ax2-> 2ax • analyze ranges y=exp; if {c} y+=5; -> y=[exp:exp+5] • c=0 DO j=1,n Recognize algorithm: if (t(j)<v) c+=1 -> c= COUNT(t[1:n]<v) ENDDO
Autotuning(dynamic compilation/adaptation) • Moving compile-time decisions to runtime • A key observation: Compiler writers “solve” difficult decisions by creating a command-line option -> finding the best combination of options means making the difficult compiler decisions.
Tuning Time PEAK is 20 times as fast as the whole-program tuning. On average, PEAK reduces tuning time from 2.19 hours to 5.85 minutes.
Program Performance The performance is the same.
Conclusions Advanced compiler capabilities are crucial for implementing tomorrow’s programming languages: • The multicore challenge -> parallel programs • Automatic parallelization • Support for speculative multithreading • Shared-memory programming support • High-level constructs • Composition pursues this goal • Techniques to reason about programs in symbolic terms • Dynamic tuning