500 likes | 597 Views
Variability-Tolerance in Tightly-Coupled Parallel Computing Units. Abbas Rahimi Advisors: Rajesh K. Gupta and Luca Benini. Variability -Tolerance in Tightly-Coupled Parallel Computing Units. Abbas Rahimi Advisors: Rajesh K. Gupta and Luca Benini. Outline. Source of variations
E N D
Variability-Tolerance in Tightly-Coupled Parallel Computing Units Abbas Rahimi Advisors: Rajesh K. Gupta and Luca Benini
Variability-Tolerance in Tightly-Coupled Parallel Computing Units Abbas Rahimi Advisors: Rajesh K. Gupta and Luca Benini
Outline • Source of variations • Taxonomy of variability-tolerance • Detect and correct timing errors • Timing error abstraction • Memoization • Spatial • Temporal • Ignore timing errors • Ensure safety of error ignorance
Sources of Variability • Variability in transistor characteristics is a major challenge in nanoscale CMOS: • Static variation: process (Leff, Vth) • Dynamic variations: aging, temperature, voltage droops • To handle variations • Conservative guardbands loss of operational efficiency Process VCCdroop Temperature guardband actual circuit delay Clock Aging Slow Fast
Variability is about Cost and Scale Eliminating guardband Timing error Costly error recovery 3×N recovery cycles per error! N= # of stages
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Detect-then-correct
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Detect-then-correct
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Detect-then-correct
Timing Error Abstraction • Work-unit Vulnerability (WUV) SW • A set of vertical abstractions that reflect manifestation of circuit-level variability (timing error) in multiple layers of the software stack • Characterized metadata for use in runtime optimizations • Task-level Vulnerability (TLV) Management Work unit Procedure/Task Sequence ISA • Procedure-level Vulnerability (PLV) • Sequence-level Vulnerability (SLV) • Instruction-level Vulnerability (ILV) HW Sensors [ILV] A. Rahimi et al., “Analysis of Instruction-level Vulnerability to Dynamic Voltage and Temperature Variations,” DATE, 2012. [SLV] A. Rahimi et al., “Application-Adaptive Guardbanding to Mitigate Static and Dynamic Variability,” IEEE Tran. on Computer, 2013. [PLV] A. Rahimi et al., “Procedure Hopping: a Low Overhead Solution to Mitigate Variability in Shared-L1 Processor Clusters,” ISLPED, 2012. [TLV] A. Rahimi et al., “Variation-Tolerant OpenMP Tasking on Tightly-Coupled Processor Clusters,” DATE, 2013.
Delay Variability among Stages • LEON3 in 65nm TSMC: • temperature: -40°C−125°C • voltage: 0.72V−1.1V • The execute and memory parts are very sensitive to voltage and temperature variations, and also exhibit a large number of critical paths in comparison to the rest of processor. • Similarly, we anticipate that the instructions that significantly exercise the execute and memory stages are likely to be more vulnerable to voltage and temperature variations Instruction-level Vulnerability (ILV)
ILV and SLV Metadata • We computed ILV (or SLV) for each instructioni (or sequencei) at every operating condition: • where Ni (Mi) is the total number of clock cycles in Monte Carlo simulation of instructioni (sequencei) with random operands. • Violationj indicates whether there is a violated stage at clock cyclej or not. • ILVi (SLVi) defines as the number of violated cycles over the simulated cycles for the instructioni (sequencei). • Therefore, the lower ILV (SLV), the better
Instructions Classification (1/2) ILV at 0.88V, while varying temperature: • Instructions are partitioned into three main classes: • Logical & arithmetic • Memory • HW Multiply & divide • The 1st class shows an abrupt behavior • Most of the exercised paths have the same length, then we have all-or-nothing effect either all instructions within this class fail or all make it
Instructions Classification (2/2) ILV at 0.72V, while varying temperature: • All instruction classes act similarly across the wide range of operating conditions: as the cycle time increases gradually, the ILV becomes 0, firstly for the 1st class, then for the 2nd class, and finally for the 3rd class. • For a given operating condition • ILV (3rd Class) ≥ ILV (2nd Class) ≥ ILV (1st Class)
Sequence Classification (1/2) SLV at (0.81V, -40°C) SLV at (0.81V, 125°C) • The top 20 high-frequent sequences (Seq1-Seq20) are extracted from 80 Billion dynamic instructions of 32 benchmarks. • Sequences are classified into two classes based on their similarities in SLV values: • Class I (Seq1-Seq19) is a mixture of all types of instructions including the memory, arithmetic/logical, and control instructions. • Class II (Seq20) only consists of the arithmetic/logical instructions.
Sequence Classification (2/2) For every operating condition: SLV (Class I) ≥ SLV (Class II) • ALU • Memory • Control Only ALU • The sequence classification is consistent across operating corners. • SLV value of two classes of the sequences at the same corner and with the same cycle time is not equal. • Sequences in Class I need higher guardbands compared to Class II • ALU's critical paths • critical paths of memory are activated (load/store) • critical paths of integer code conditions (control instructions) • Loop unrolling to increase the ratio of Class II to Class I • throughput gain of 5%
Single-core to Multi-core! The programming model and runtime environment of MIMD should be aware of variations! Variations are more exacerbated by multi-core systems VDD = 0.99V VDD = 0.81V VA-VDD-Hopping=( , 0.81V 0.99V ) Procedure hopping Three corescannot meet the target frequency of 830MHz. All cores of the same cluster meet the target frequency of 830MHz. VA-VDD-hopping can accordingly tune the cores' voltage based on their delay. FIR IR-dropin various cores
Task-Level Vulnerability (TLV) • OpenMP Tasking: • A convenient abstraction for programmers to express irregular and unstructured parallelism • Enables scheduler for better choices • TLV is a per core and per task type metric: • ∑EI is # of errant instructions during taskj on corei • Length is total # of executed instructions • The lower TLV, the better!
Intra- and Inter-Corner TLV • TLV across various type of tasks: TLV of each type of tasks is different (up to 9×) even within the fixed operating condition in a corei Intra-corner TLV at fix (1.1V, 25°C) • Inter-corner TLV (across various operating conditions for 45nm) • The average TLV of the six types of tasks is an increasing function of temperature. • In contrast, decreasing the voltage from the nominal point of 1.1V increases TLV. Inter-corner TLV
Variation-Tolerant Cluster (1/2) • Inspired by STM STHORM • 16x 32-bit RISC cores • L1 SW-managed Tightly Coupled Data Memory (TCDM) • Multi-banked/multi-ported • Fast concurrent readaccess • Fast log. interconnect • One clock domain • Bridge towards NoC CORE 0 VDD-hopping CORE M VDD-hopping Var. sensor Var. sensor Replay Replay I$ I$ I$ MASTER PORT MASTER PORT LOW-LATENCY LOGARITHMIC INTERCONNECT SLAVE PORT SLAVE PORT SLAVE PORT SLAVE PORT L2/L3 BRIDGE SHARED L1 TCDM test-and-setsemaphores BANK 0 BANK 1 BANK N
Variation-Tolerant Cluster (2/2) • Every core is equipped with: • EDS (Error sensing [Bowman’09]) • detect any timing error due to dynamic delay variation • Multiple-issue replay (Error recovery [Bowman’11]) • to recover the errant instruction without changing the clock frequency • VDD-hopping [Miermont’07] • to compensate the impact of static process variation • Online SW measurements • Per-core TLV metadata characterization • Fast accesses through L1 TCDM. VDD-Hopping CORE 0 EDS Replay I$ MASTER PORT
OpenMP Tasking #pragmaompparallel { #pragmaomp single { for (i = 1...N) { #pragmaomp task FUNC_1 (i); #pragmaomp task FUNC_2 (i); } } } /* implicitbarrier */ Task queue • Task descriptorscreateduponencountering a taskdirective • Task fetched by any core encountering a barrier • task directives identify given portions of code (tasks) • A task type is defined for every occurrence of the taskdirective in the program TCDM Push task Task descriptor Fetch and execute (FIFO) two task types
Characterizing OpenMP Tasks • Online TLV characterization • TLV table: LUT containing TLV for every core and task type • Reside in TCDM. Parallelinspection from multiple cores • Each core collects TLV information in parallel • Distributed scheduler • LUT updatedatevery task execution voidhandle_tasks () { while (HAVE_TASKS) { // Task scheduling loop task_desc_t *t = EXTRACT_TASK (); if (t) { floatOtlv = tlv_table_read (task_type_id, corte_id) /* Reset counter for this core */ tlv_reset_task_metadata (core_id); /* EXEC! */ t->task_fn (t->task_data); /* We executed. Fetch TLV ...*/ float tlv = tlv_read_task_metadata (core_id); /* Update TLV. Average new and old value */ tlv_table_write(t->task_type_id, core_id, (tlv-Otlv)/2); } } } VDD-Hopping TCDM CORE 0 cores Var-Sensor Replay TLV-table 0.11 I$ task types MASTER PORT
TLV-aware Extensions #pragmaompparallel { #pragmaomp single { for (i = 1...N) { #pragmaomp task FUNC_1 (i); #pragmaomp task FUNC_2 (i); } } } /* implicitbarrier */ Task queue • Variation-tolerantOpenMPscheduler • A core tries to fetch a task if its TLV ≤ threshold • to minimize number of errant instructions (and costly replay cycles) • Limited number of rejects for a given tasks avoid starvation TCDM Task descriptor Fetch and execute (FIFO) TLV-aware fetch
Variation-aware Scheduling TLV-table TCDM core_escape_cnt Task queue taskj=HEAD_QUEUE() TLV(i,j) = tlv_table_read(corei, taskj); if (TLV(i,j)> TLV_THR && corei_escape_cnt <ESCAPE_THR) { corei_escape_cnt ++; escape (taskj); } else { assign_to_corei(taskj); corei_escape_cnt = 0; }
Experimental Setup • Architecture:SystemC-based virtual platform modeling the tightly-coupled cluster • Benchmark: Seven widely used computational kernels from the image processing domain are parallelized using OpenMP tasking. • The TLV lookup table only occupies 104−448 Bytes depending upon the number of task types.
Experimental Setup: Variability Modeling Each core optimized during P&R with a target frequency of 850MHz. @ Sign-off: die-to-die and within-die process variations are injected using PrimeTime VX and variation-aware 45nm TSMC libs (derived from PCA) Six cores (C0, C2, C4, C10, C13, C14) cannot meet the design time target frequency of 850 MHz All cores can work with the design time target frequency of 850 MHz but multiple voltage OpPs • To emulate variations, we have integrated variations models at the level of individual instructions using the ILV characterization methodology. • ILV models of 16-core LEON-3 for TSMC 45-nm, general-purpose process with normal VTH cells. • Vdd-hopping is applied to compensate injected process variation. Process Variation Vdd-Hopping
IPC of Variability-affected Cluster • Scheduler decreases the number of recovery cycles • Cores incur fewer errant instructions • The normalized IPC is increased by 1.17× (on average) for all benchmarks executing at 10°C.
WU type 1 Next: Work-unit Vulnerability #pragmaomp parallel { #pragmaomp for for (i=0; i<N; i++) loop_A(); #pragmaomp sections { #pragmaomp section section_A(); #pragmaomp section section_B(); } for (i=0; i<N; i++) #pragmaomp task loop_B(); } For a given work-unit type: select a subset of cores | reduce recovery cycles ? Scheduler WU type 2 WU type 3 WUV: all work-sharing of OpenMP Hierarchical scheduling for multi-cluster WU type 4
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Detect-then-correct
Detect and Ignore • Inherent tolerance of some (type of) applications To ensure that it is safe NOT to correct a timing error: • error rate ≤ thresholdr • error significance ≤thresholds • A region of the code that can produce an acceptable fidelity metric by tolerating the uncorrected (thus intentionally propagated) errors under i and ii constraints. SW Elastic Program Model x ✔ HW Sensors [OpenMP Approximation] A. Rahimi et al., “A Variability-Aware OpenMP Environment for Efficient Execution of Accuracy-Configurable Computation on Shared-FPU Processor Clusters,” CODES+ISSS, 2013.
OpenMP Compiler Extension #pragmaompaccurate structured-block #pragmaompapproximate[clause] structured-block error_significance_threshold (<value N>) #pragma omp parallel { #pragma omp accurate #pragmaomp for for (i=K/2; i <(IMG_M-K/2); ++i) { // iterate over image for (j=K/2; j <(IMG_N-K/2); ++j) { float sum = 0; int ii, jj; for (ii =-K/2; ii<=K/2; ++ii) { // iterate over kernel for (jj = -K/2; jj <= K/2; ++jj) { float data = in[i+ii][j+jj]; float coef = coeffs[ii+K/2][jj+K/2]; float result; #pragmaomp approximate error_significance_threshold(20) { result = data * coef; sum += result; } } } out[i][j]=sum/scale; } } } Code snippet for Gaussian filter utilizing OpenMP variability-aware directives int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_MUL, 20); GOMP_FP (ID, data, coeff, &result); int ID = GOMP_resolve_FP (GOMP_APPROX, GOMP_ADD, 20); GOMP_FP (ID, sum, result, &sum); Invokes runtime scheduler programs the FPU For MUL & ADD: Disable error sensing on the less significant 20 bits of the fraction
Runtime Support for Approximation • Scheduler ranks all the FP pipelines based on their ILV. • For every approximate instruction: Find a FPK such that ILV(FPK)≤thresholdr Set error-significant (FPK)=thresholds Appr. Appr. Appr. Appr. Acc. Acc. Acc. Acc. Allocate FP2 Allocate FP1 Allocate FPK Allocate FPN Configure opmode Configure opmode Configure opmode Configure opmode No No No No No No No No Start point Busy (FP1)? Busy (FP2)? Busy (FPK)? Busy (FPN)? Accurate Yes Yes Yes Yes Yes … … Virtual End point Yes Yes Yes Approximate For every operation type of FP, sorted list of: ILV (FP1) ≤ … ≤ ILV (FPK) ≤ … ≤ ILV (FPN) ILV (FPK) < thresholdr for approximate computation
Execution with Approximation Directives By ignoring the errors within the bit position of 0 to 20 of the fraction 25% 23% • Sobel program (60x60): • Accurate: the shared-FPUs consume 4.6μJ • Approximate version: reduces the energy to 3.5μJ • 25% energy saving
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Detect-then-correct
Homogenous Workload • Homogenous workload • No more choice • We also do want to maintain full utilization (parallelism) • Cost of recovery is exacerbated in SIMD pipelined: • Vertically: any error within any of the lanes will cause a global stall and recovery of the entire SIMD pipeline. • Horizontally: higher pipeline latency causes a higher cost of recovery through flushing and replaying. quadratically expensive Wide lanes Deep pipes
Memoization: in Time or Space Temporal error correction Contextb[t-1] Contexta[t-1] Contextc[t-1] ✔ ✔ ✔ Contextc[t-k] Contexta[t] Contextc[t] Contexta[t-k] Contextb[t] Contextb[t-k] … x … … ✔ ✔ ✔ ✔ ✔ ✔ Contexti Spatial error correction Reuse HW Sensors [Temporal] A. Rahimi et al., “Temporal Memoization for Energy-Efficient Timing Error Recovery in GPGPUs,” DATE, 2014. [Spatial] A. Rahimi et al., “Spatial Memoization: Concurrent Instruction Reuse to Correct Timing Errors in SIMD,” IEEE Tran. on CAS-II, 2013.
Concurrent/Temporal Inst. Reuse (C/TIR) • Parallel execution in SIMD provides an ability to reuse computation and reduce the cost of recovery by leveraging inherent value locality • CIR: Whether an instruction can be reused spatially across various parallel lanes? • TIR: Whether an instruction can be reused temporally for a lane itself? • Utilizing memoization: • C/TIR memoizesthe result of an error-free execution on an instance of data. • Reuses this memoized context if they meet a ValueLocalityConstraint CIR TIR
(My) Taxonomy of Variability-Tolerance Guardband Adaptive Eliminating No timing error Timing error Predict & prevent Error ignorance Error recovery Ensuring safety of error ignorance through a set of rules Any choice? No Yes Detect & ignore Abstraction Memoization Recalling recent context of error-free execution Exposing timing error to higher levels for better management Approximate error correction! Detect-then-correct
Value Locality Constraint ValueLocalityConstraint determines whether there is a value locality between errant and memoized instructions: • α is a tight constraint without masking (enforces full bit-by-bit matching) • β relaxes the criteria of α during the comparison of the operands by masking the less significant 11 bits of the fraction parts • γ ignores 12 bits of the fraction parts • Accurate and approximate error correction: • Reusing the memoized context to exactly (or approximately) correct any errant execution on other instances of the same (or adjacent) data
CIR and PSNR Degradation (1/2) γ: CIR rate of 51% forSobel with the PSNR of 29 dB γ: CIR rate of 32% forGaussian with the PSNR of 39 dB
CIR and PSNR Degradation (2/2) Value locality constraint of γ • γ Higher multiple data-parallel values fuse into a single value higher CIR rate for approximate error correction, up to 76% for Sobel. • On average, by applying γ, a CIR rate of 51% (32%) is achieved on Sobel (Gaussian) with the acceptable PSNR of 29 dB (39 dB). Gaussian filter Sobel filter
Architectural Support for CIR Single Strong lane, Multiple Weak lanes (SSMW) SS lane memoizes output of an error-free instruction If any MW lane faces an error, it reuses the output of SS lane
Effectiveness of CIR during Voltage Droops • On average, for all kernels: • SSMW avoids the recovery for 62% of the errant instructions • Confirming the effective utilization of the value locality.
Architectural Support for TIR Temporal memoization module
Energy Efficiency of TIR • On average, TIR reaches 16% higher GFLOPS/Watt compared to the baseline architecture. • It can maintain the energy efficiency in case of a poor locality, for example, if the overall hit rate drops by 46% (for EigenValue), 33% (for Black-Scholes), 25% (for Box), 20% (for Haar), and 11% (for URNG).
Energy Saving of TIR • Under voltage overscaling: • Achieves 8% average energy saving at the nominal voltage of 0.9V • At voltage of 0.8V, TIR reaches an average energy saving of 66%. Baseline faces abrupt error rates and therefore frequent recoveries.
Next: Memristive-based Spatiotemporal CIR TIR TIR Temporal Spatial CIR Spatiotemporal Memoization Memristive-based elements (TCAM) Collaborative compilation Approximate storage Memristor TCAM TIR TIR High-frequent computations Training datasets Kernel Profiling
Publication Plan Guardband Adaptive Eliminating No timing error Timing error DATE’13-b DAC’13 CODES’13-b Error ignorance Error recovery Memristive for a journal CODES’13-a Any choice? No Yes Abstraction Memoization Spatiotemporal for a journal WUV for a journal + Sampling + Approximate Verification DAC’14 DATE’14 TCAS-II’13 DATE’13-a ISPLED’12 TC’13 DATE’12
Thank you! Farshchian’s work “The Fifth Day of Creation,”