400 likes | 549 Views
ABACUS: A Hardware-Based Software Profiler for Modern Processors. Sergey Blagodurov • Sergey Zhuravlev • Alexandra Fedorova School of Computing Science. Eric Matthews • Lesley Shannon School of Engineering Science. Simon Fraser University, Vancouver, BC, Canada. Overview.
E N D
ABACUS: A Hardware-Based Software Profiler for Modern Processors Sergey Blagodurov • Sergey Zhuravlev • Alexandra Fedorova School of Computing Science Eric Matthews • Lesley Shannon School of Engineering Science Simon Fraser University, Vancouver, BC, Canada
Overview • Legendary Introduction to ABACUS • Delicious Profiling Units • Epic Conclusion 2
ABACUS 7
ABACUS ASPLOS rocks! 8
ABACUS 9
Performance comparison • Memory Reuse Profile • ABACUS avg runtime: 48.5seconds • Simics avg runtime: 1 hour 6minutes ABACUS Simics 10
Conclusion • ABACUS is a generic profiler that can be easily integrated into modern processors • It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 11
Motivation • Future systems will be multi-core and heterogeneous • How does the OS place threads on this architecture? • Characterize thread behaviour • Instruction Mix • Memory Reuse Profile • Effectiveness of pre-fetching • Memory bandwidth utilization 13
Motivation (cont'd) • How are these metrics collected? • Offline analysis • Code Instrumentation • Simulation (e.g., Simics) • Software-based instruction set simulator • Models systems with full OS support 14
Motivation (cont'd) • Why not use current hardware counters? • Architecture-specific • Not all desired metrics provided • Help detect symptoms, not causes • Limited in number and in concurrent use 15
Goal • Create a hardware profiler to collect thread characteristics at runtime • Imposed constraints • External to processor • Minimally invasive • Cycle accurate • OS controllable 16
ABACUS • hArdware-Based Analyzer for the Characterization of User Software • A collection of runtime configurable profiling units • Collects metrics useful for thread placement • Controllable through the O/S 17
Hardware Platform • Proof-of-concept System • LEON3 Sparc v8 Instruction Set Architecture • Single core, single threaded • Test System • OpenSparc Niagara T1 soft processor • 1 to 4 hardware threads • Multi-core Multi-board support 18
ABACUS 20
External Interface • Bus slave and master modules • Processing required on processor signals • Designed such that only external interface changes with different processor/system 21
Portability • Previously integrated with a LEON3 (Sparc v8 ISA) based system • Differences: • AMBA Advanced High-performance Bus (AHB) vs Processor Local Bus (PLB) • Processor internals 22
Controller • Starts or stops profiling • Can limit profiling to a specific address range • DMA interface for retrieving collected data • Linux device driver support 23
Profiling Units • Operate on one or more processor signals: • Instruction • PC • Cache Reuse Distance • etc. • Store data in a collection of counters 24
Profiling Units (cont'd) • Focus on two dimensional metrics • Gives bigger picture / greater insight • Aim to be as architecture independent as possible 25
Profile Unit • Behaves like a traditional software profiler • Operates on Program Counter Code Space Range Overlap Range Non-Overlap Trace 26
Memory Reuse Unit • Collects a measure of code or data reuse • Utilizes Least Recently Used (LRU) stack • Reuse distance is movement in the LRU stack or a miss • Uses in cache contention management 27
Memory Reuse Unit • Creates histogram of cache reuse pattern • Range: [0, set associativity – 1] or cache miss 4-way set-associative reuse profile Reuse Distance 28
Instruction Mix • Identify current instruction subset in use • Divide instructions into logical categories • Load/Store • Floating Point • Control Flow • Opcode-based table lookup 29
Latency Unit • Break down miss latency into constituent sources • Bus contention • DRAM latency • etc. • For each category create a histogram of latency in cycles 30
Stall Unit • Break down Cycles Per Instruction • Attribute cycles to their sources • Cache miss • Translation Lookaside Buffer (TLB) miss • Floating Point busy stalls • etc. 31
Verification • Run a subset of the SPECCPU2006 benchmarks • Those with memory usage within board specs • Collect metrics with ABACUS and Simics • Profile for a few billion instructions • Limited by Simics performace 32
Test Platform • Proof-of-concept System • Single core, single threaded XUP V2Pro: 90% slice utilization 33
Simulation Platform • Simics System: • Differences: • SPARC v9 ISA (64-bit processor) • Local filesystem vs NFS 34
LEON3 Comparison ABACUS Simics 35
LEON3 Comparison (cont'd) • DC Memory Reuse Profile ABACUS Simics 36
Resource Usage • Default: 2–way LRU Instruction Cache 2–way LRU Data Cache 5 Instruction Types 32bit counters 40bit counters 32bit counters Profile Unit added 37 37
Conclusion • ABACUS is a generic profiler that can be easily integrated into modern processors • It can be used by the O/S to obtain runtime information about a thread’s behaviour to make better thread assignments 38
Future Plans • Move to multi-core/multi-threaded system • Memory reuse distance independent of existing cache implementation • Process tracking • Integrate results into OS scheduler 39