410 likes | 561 Views
Polar Opposites: Next Generation Languages & Architectures. Kathryn S McKinley The University of Texas at Austin. Collaborators. Faculty Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, Graduate Students
E N D
Polar Opposites:Next Generation Languages & Architectures Kathryn S McKinley The University of Texas at Austin
Collaborators • Faculty • Steve Blackburn, Doug Burger, Perry Cheng, Steve Keckler, Eliot Moss, • Graduate Students • Xianglong Huang, Sundeep Kushwaha, Aaron Smith, Zhenlin Wang (MTU) • Research Staff • Jim Burrill, Sam Guyer, Bill Yoder
Computing in the Twenty-First Century New and changing architectures • Hitting the microprocessor wall • TRIPS - an architecture for future technology Object-oriented languages • Java and C# becoming mainstream Key challenges and approaches • Memory gap, parallelism • Language & runtime implementation efficiency • Orchestrating a new software/hardware dance • Break down artificial system boundaries
Technology Scaling Hitting the Wall Analytically … Qualitatively … 35 nm 70 nm 100 nm 130 nm 20 mm chip edge Either way … Partitioning for on-chip communication is key
End of the Road for Out-of-Order SuperScalars • Clock ride is over • Wire and pipeline limits • Quadratic out-of-order issue logic • Power, a first order constraint • Major vendors ending processor lines • Problems for any architectural solution • ILP - instruction level parallelism • Memory latency
Where are Programming Languages? • High Productivity Languages • Java, C#, Matlab, S, Python, Perl • High Performance Languages • C/C++, Fortran • Why not both in one? • Interpretation/JIT vs compilation • Language representation • Pointers, arrays, frequent method calls, etc. • Automatic memory management costs • Obscure ILP and memory behavior
Outline • TRIPS • Next generation tiled EDGE architecture • ILP compilation model • Memory system performance • Garbage collection influence • The GC advantage • Locality, locality, locality • Online adaptive copying • Cooperative software/hardware caching
TRIPS • Project Goals • Fast clock & high ILP in future technologies • Architecture sustains 1 TRIPS in 35 nm technology • Cost-performance scalability • Find the right hardware/software balance • New balance reduces hardware complexity & power • New compiler responsibilities & challenges • Hardware/Software Prototype • Proof-of-concept of scalability and configurability • Technology transfer
Execution Substrate Interconnect topology & latency exposed to compiler scheduler Register banks Execution node Global Ctrl Branch Predictor I-cache H 0 1 2 3 D-cache/LSQ 0 I-cache 0 D-cache/LSQ 1 I-cache 1 Execution array D-cache/LSQ 2 I-cache 2 D-cache/LSQ 3 I-cache 3
Large Instruction Window • Instruction buffers add depth to execution array • 2D array of ALUs; 3D volume of instructions • Entire 3D volume exposed to compiler Control src1 opcode src2 src2 src1 opcode src2 src1 opcode opcode src1 src2 ALU Out-of-Order Instruction Buffers form a logical “z-dimension” in each node Router 4 logical frames of 4 X 4 instructions Execution Node
Execution Model • SPDI - static placement, dynamic issue • Dataflow within a block • Sequential between blocks • TRIPS compiler challenges • Create large blocks of instructions • Single entry, multiple exit, predication • Schedule blocks of instructions on a tile • Resource limitations • Registers, Memory operations
Block Execution Model • Program execution • Fetch and map block to TRIPS grid • Execute block, produce result(s) • Commit results • Repeat • Block dataflow execution • Each cycle, execute a ready instruction at every node • Single read of registers and memory locations • Single write of registers and memory locations • Update the PC to successor block • TRIPS core may speculatively execute multiple blocks (as well as instructions) • TRIPS uses branch prediction and register renaming between blocks, but not within a block start A C B D E end
Just Right Division of Labor • TRIPS architecture • Eliminates short-term temporaries • Out-of-order execution at every node in grid • Exploits ILP, hides unpredictable latencies • without superscalar quadratic hardware • without VLIW guarantees of completion time • Scale compiler - generate ILP • Large hyperblocks - predicate, unroll, inline, etc. • Schedule hyperblocks • Map independent instructions to different nodes • Map communicating instructions to same or close nodes • Let hardware deal with unpredictable latencies (loads) Exploits Hardware and Compiler Strengths
High Productivity Programming Languages • Interpretation/JIT vs compilation • Language representation • Pointers, arrays, frequent method calls, etc. • Automatic memory management costs MMTk in IBM Jikes RVM • ICSE’04, SIGMETRICS’04 • Memory Management Toolkit for Java • High Performance, Extensible, Portable • Mark-Sweep, Copying SemiSpace, Reference Counting • Generational collection, Beltway, etc.
Allocation Choices Bump-Pointer Free-List • Fast (increment & bounds check) • Can't incrementally free & reuse: must free en masse • Relatively slow (consult list for fit) • Can incrementally free & reuse cells
Allocation Choices • Bump pointer • ~70 bytes IA32 instructions, 726MB/s • Free list • ~140 bytes IA32 instructions, 654MB/s • Bump pointer 11% faster in tight loop • < 1% in practical setting • No significant difference (?) • Second order effects? • Locality?? • Collection mechanism??
Implications for Locality • Compare SS & MS mutator • Mutator time • Mutator memory performance: L1, L2 & TLB
MS/SS Crossover locality space 2.6GHz 1.6GHz 1.9GHz 3.2GHz
Locality in Memory Management • Explicit memory management on its way out • Key GC vs Explicit MM insights 20 yrs old • Technology has and is changing • Generational and Beltway Collectors • Significant collection time benefits over full heap collectors • Collect young objects • Infrequently collect old space • Copying nursery attains similar locality effects as full heap
Where are the Misses? Generational Copying Collector
Copy Order • Static copy orders • Bredth first - Cheney scan • Depth first, hierarchical • Problem: one size does not fit all • Static profiling per class • Inconsistant with JIT • Object sampling • Too expensive in our experience • OOR - Online Object Reordering • OOPSLA’04
OOR Overview • Records object accesses in each method (excludes cold basic blocks) • Finds hot methods by dynamic sampling • Reorders objects with hot fields in higher generation during GC • Copies hot objects into separate region
Static Analysis Example Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } } Hot BB Collect access info Compiler Compiler Cold BB Ignore Access List: 1. A.b 2. …. ….
Adaptive Sampling Method Foo { Class A a; try { …=a.b; … } catch(Exception e){ …a.c } } Adaptive Sampling Foo Accesses: 1. A.b 2. …. …. Foo is hot A.b is hot A b c ….. B
Advice Directed Reordering • Example • Assume (1,4), (4,7) and (2,6) are hot field accesses • Order: 1,4,7,2,6 : 3,5 1 5 4 2 3 7 6
OOR System Overview Hot Methods Source Code Look Up Access Info Database Adaptive Sampling Baseline Compiler Optimizing Compiler Adds Entries Register Hot Field Accesses GC: copying objects GC: Copies Objects Executing Code Affects Locality Advice OOR addition Input/Output Jikes RVM
Software is not enoughHardware is not enough • Problem: inefficient use of cache • Hardware limitations: set associativity, cannot predict the future • Cooperative Software/Hardware Caching • Combines high level compiler analysis with dynamic miss behavior • Lightweight ISA support conveys compiler’s global view to hardware • Compiler-guided cache replacement (evict-me) • Compiler-guided region prefetching • ISCA’03, PACT’02
Exciting Times • Dramatic architectural changes • Execution tiles • Cache & Memory tiles • Next generation system solutions • Moving hardware/software boundaries • Online optimizations • Key compiler challenges (same old…) ILP and Cache Memory Hierarchy