240 likes | 334 Views
Detailed Cache Coherence Characterization for OpenMP Benchmarks. Jaydeep Marathe 1 , Anita Nagarajan 2 , Frank Mueller 1. 1 Department of Computer Science, NCSU 2 Intel Technology India Pvt. Ltd. Our Focus. Target shared memory SMP systems.
E N D
Detailed Cache Coherence Characterization for OpenMP Benchmarks Jaydeep Marathe1, Anita Nagarajan2, Frank Mueller1 1 Department of Computer Science, NCSU 2 Intel Technology India Pvt. Ltd. NC State University
Our Focus • Target shared memory SMP systems. • Characterize coherence behavior at application level. • Metrics guide coherence optimization. Bus-Based Shared Memory SMP Processor-1 Processor-N Cache Cache Coherence Protocol Bus NC State University
Invalidation-based coherence protocol • Cache lines have state bits. • Data migrates between processor caches, state transitions maintain coherence. • MESIProtocol: 4 states:M: Modified, E: Exclusive, S: Shared, I: Invalid Processor A Processor B x 1. Read “x” IE “exclusive” x x 2. Read “x” ES IS “shared” “shared” 3. Write “x” x Cache line was invalidated SM SI “modified/dirty” “invalid” NC State University
A performance perspective • Writes to shared variables cause invalidation traffic. (E->I, M->I, S->I) • Worse, invalidations lead tocoherence misses! Proc A. Proc B. 1. Write shared varsVar_1 Var_2 Var_n invalidations ! 2. Read Var_1 Var_2 Var_n Data Coherence Misses in Proc. B ! 3. Write Var_1 Var_2 Var_3 invalidations ! “Coherence Bottlenecks” Reducing Coherence Misses, Invalidations Improved Performance ! NC State University
Question: Does a coherence bottleneck exist ? One Approach: Time-based Profilers Load Imbalance between 2 threads (KAI GuideView) Parallel Time Imbalance Time Problem:ImplicitInformation – Does imbalance/ speedup loss indicate a coherence bottleneck ? - We can’t tell ! NC State University
Does a coherence bottleneck exist ? (contd) Another Approach: Using Hardware counters Total Misses Coherence Misses Invalidations # Total P-1 P-2 P-3 P-4 Processors + Detect potential coherence bottlenecks. - Block level statistics – perturbation prevents fine-grained monitoring. - Can’t diagnose cause ! - Which source code refs ? data structures ? Need more detailed information ! NC State University
What we offer.. • Hierarchicallevels of detail.. Overall statistics Per-reference metrics Invalidator True-Sharing False-Sharing V2 4811 20 V3 3199 0 Clock 200 1000 .... .... .... Invalidator Set for each reference • Rich metrics – coherence misses, true & false sharing, per-reference invalidators • Facilitates easy isolation of bottleneck references ! NC State University
Our Framework • Bound OpenMP threads for SMP parallelism • Access Traces using dynamic binary rewriting • Traces used for incremental SMP cache simulation. (L1+L2+coherence) Instrumentation Trace Generation Thread-N Thread-1 Execute Thread-0 Handler() Instrument Compression Handler() Handler() Target OpenMP Executable Simulation Access Trace Access Trace Access Trace Controller Extract Detailed Coherence Metrics Target Descriptor SMP Cache Simulator Instruction Line, File Global & Local Variables NC State University
Target Executable Instrumentation Machine Code CFG Controller DynInst myfunc: ….. ….. CALL _xlsmpParallelDoSetup … LOAD B[I] LOAD C[I] STORE A[I] … Exit_from_parallel_for … CALL _xlsmpBarrier myfunc() { …. #pragma omp parallel for For(I=0;I < N;I++) { A[I] = B[I] + C[I]; }//end parallel for … #pragma omp barrier Instrumentation Points • Enhanced DynInst Dynamic Binary Rewriting package ( U. Wisconsin) • Instrument Memory access instructions (LD/ST) • Instrument Compiler-generated OpenMP construct functions. NC State University
Per-reference metrics • Uniprocessor Misses • Coherence Misses • Invalidations • List & count of Invalidator references “Fork-Join” Model //serial code ….. #pragma omp parallel { …. … }//end parallel … //serial code … #pragma omp parallel do { …. …. }//end parallel … “Region” Invalidations “In”-Region “Across”-Region “Region” False Sharing “Region” True Sharing False Sharing True Sharing “Region” NC State University
In-depth Example: SMG2000 • ASCI Purple benchmark • Has been scaled up to 3150 processors • Hybrid OpenMP + MPI Parallelization • Code is known to be memory-intensive • Code Characteristics • 72 Files, 24213 lines (non-whitespace) • Instrumentation Characteristics • 4 OpenMP threads, default workload • Functions instrumented:313 (69 OpenMP, 244 Others) • Access Points instrumented:10692 • 8531 Load (2184 64-bit, 6329 32-bit, 18 8-bit) • 2161 Store (722 64-bit, 1425 32-bit, 14 8-bit) • Tracing:16.73 Million accesses logged. NC State University
Overall Performance: SMG2000 10 A. Overall Misses B. Cumulative Coherence Misses • Most L2 misses are Coherence misses • Most Invalidations result in Coherence Misses • Only ~280 out of 10692 access points show coherence activity (2.6% !) • Only ~10% of these points account for >= 90% of the coherence misses NC State University
Drilling Down: Per-Access Point Metrics • Top-5 Metrics for Processor-1 • Group-1 Refs: False-sharing In-Region (Same OpenMP region) invalidations dominate • Group-2 Ref: True-sharing In-Region invalidations only NC State University
Drilling Further: A Ref & its invalidators for k = 0 to Kmax for j =0 to Jmax #pragma omp parallel do for i = 0 to Imax { ... rp[k][j][i] = rp[k][j][i] – Ai[] * xp[]; }//end omp do Invalidator List Cache Line Cache Line P1 P2 P3 P4 • Large number of False-In Region Invalidations ! • Sub-optimal Parallelization: Fine-grained sharing • Solution: Parallelize Outermost loop (Coarsening) NC State University
Another Optimization #pragma omp parallel num_threads=omp_get_num_threads(); Invalidator List Cache Line num_thread P1 P2 P3 P4 • Multiple threads updating same shared variable ! • Solution: Remove unnecessary sharing (SharedRemoval) • num_threads = omp_get_max_threads(); NC State University
Impact of Optimizations • SMG2000 run on IBM SP Blue Horizon. (POWER3) • Wall-clock times for recommended full-sized workloads (threads = 1, 2, 4, 8) • Maximum of 73% improvement for 4th Workload NC State University
Highlights & Future Directions Highlights • First tool for characterizing OpenMP SMP performance • Dynamic Binary Rewriting – no source code modification ! • Detailed source-correlated statistics. • Rich set of coherence metrics. Future Directions • Use of partial access traces. (intermittent instrumentation) • Other Threading Models – Pthreads, etc. • Characterizing Perennial Server applications (Apache) NC State University
The End NC State University
Simulator Accuracy: Comparing #invalidations • NAS 3.0 OpenMP benchmarks + NBF • Total Invalidations: Hardware Counters(IBM SP: HPM) vs. Simulator (ccSIM) • Account for OpenMP runtime overhead in HPM. • 16.5% Maximum absolute error, most benchmarks have <= 7% error. NC State University
Related Work MemSpy: Martonosi et. al.(1992) • + Execution-driven simulation • + Classifies by code & data objects • + Invalidations & coherence misses • No true/false sharing • No invalidator lists • Compiler-inserted instrumentation • Uniprocessor-simulated parallel threads SM-Prof: Brorsson et.al.(1995) • + Variable Classification tool • + Access classes of “shared/private read/write few/many” • Cant detect true/false sharing, magnitude • No coherence misses,invalidator lists Rsim, Proteus, SimOS • Architecture-oriented simulators, only bulk statistics • Not meant for application developers NC State University
Tracing Overhead (earlier work: METRIC) • 1-3 Orders of Magnitude overhead, in most cases. • Conventional breakpoints (TRAP) have > 4Orders of Magnitude overhead. NC State University
Trap-based instrumentation From DynInst Documentation dyninst gdb application # operations ops/sec time (sec) time (sec) compress 95 32,513 406,655.7 0.0874.35 li (xlmatch) 110,209 43,607.7 2.53221.04 li (compare) 4,475 640.2 6.9916.39 li (binary) 401 19.4 20.6921.62 NC State University
Compression Ratio (earlier work: METRIC) • Comparison of Uncompressed and Compressed Stream sizes • 1 Million Accesses Logged • 2-4 Orders of Magnitude Compression , in most cases. NC State University