320 likes | 466 Views
Paper Report. Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor. Christopher LaFrieda , Engin Ipek Jos´e F, Mart´ınez , Rajit Manohar Computer Systems Laboratory, Cornell University
E N D
Paper Report Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor Christopher LaFrieda, EnginIpekJos´eF, Mart´ınez, RajitManohar Computer Systems Laboratory, Cornell University 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07) Cite count: 66 Presenter: Jyun-Yan Li
Abstract • Aggressive CMOS scaling will make future chip multiprocessors (CMPs) increasingly susceptible to transient faults, hard errors, manufacturing defects, and process variations. Existing fault-tolerant CMP proposals that implement dual modular redundancy (DMR) do so by statically binding pairs of adjacent cores via dedicated communication channels and buffers. • This can result in unnecessary power and performance losses in cases where one core is defective (in which case the entire DMR pair must be disabled), or when cores exhibit different frequency/leakage characteristics due to process variations (in which case the pair runs at the speed of the slowest core).
Abstract (cont.) • Static DMR also hinders power density/thermal management, as DMR pairs running code with similar power/thermal characteristics are necessarily placed next to each other on the die. • We present dynamic core coupling (DCC), an architectural technique that allows arbitrary CMP cores to verify each other’s executionwhile requiring no static core binding at design time or dedicated communication hardware. Our evaluation shows that the performance overhead of DCC over a CMP without fault tolerance is 3% on SPEC2000 benchmarks, and is within 5% for a set of scalable parallel scientific and data mining applications with up to eight threads (16 processors). Our results also show that DCC has the potential to significantly outperform existing static DMR schemes.
What’s problem • Chip multiprocessors (CMPs) become the major for performance growth • Susceptible to soft errors, manufacturing defects … • Static dual modular redundancy (DMR) based CMPs • can’t check hard or soft error if a core failure • Consisting of cores with different frequency • Degrade the performance • Limiting power density/thermal management • similar power/thermal characteristics are placed next to each other
Related work Using SMT to detect transient fault Leading thread stores results in a delay buffer, and trailing thread re-executes and compare result Checkpointed Early Resource RecYcling (Cherry) proposes checkpoint and rollback in the out-of-order processor AR-SMT [19] Cherry [11] Support multi-cores SRT [17] Cherry-MP [9] CRT [14] Reunion [23] Adding recovery Adding recovery MOESI coherence protocol SRTR [29] CRTR [7] Chip-level multi-processor Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor This paper:
Background • Deep submicron challenges • Soft errors • Manufacturing defects & process variations • Early lifetime failures • Fault tolerance • Synchronization • Input replication • Output comparison • Recovery Chip-Level Redundantly Threaded Processor (CRT) Simultaneous and Redundantly Threaded Processor (SRT)
Deep submicron challenges • Soft errors – transient fault • With manufacturing processes improvement, more sensitive to soft error • Storage structures: protected by ECC or parity • Combinational logic: the major source of soft error • Manufacturing defects & process variations • fabrication related failure mechanism or process variations • Burn-in tests accelerates infant mortality and exposes latent failures • 9 cores, 8 cores after manufacturing, 7 cores for ship chip • Early lifetime failures • Hard failure by electromigration, stress migration, time-dependent dielectric breakdown, and thermal cycling
Fault tolerance - Synchronization • Compare results • Synchronizing both threads • Lockstep: executing same instruction in a cycle • Hard achieving because of contention for shared resources • Checking result by trailing thread to compare results • Queues: leading thread forwards result, trailing thread reads and aligning it • SRTR&CRTRstores in the dedicated queues • Reunion stores in the speculative portion of the store buffer • Every 50 instructions in Reunion • Maximum 100 instructions
Fault tolerance - Input replication • Problem: • time of updating memory address or data between load operation of leading thread and trailing thread • Load Value Queue (LVQ) – SRTR & CRTR • Forward the result to the trailing thread’s LVQ • Trailing thread load value form the LVQ • Relaxed input replication - Reunion • Roll back & single-step executing until the first memory instruction
Fault tolerance - Output comparison • SRTR & CRTR • Leading thread forwards result to trailing thread’s register value queue (RVQ) and store buffer (StB) • Trailing thread compares them • Reunion • Fingerprinting • Compress all the new state each cycle into a 16-bit signature • Reduce the overhead of communicating results
Fault tolerance – Recover • Backward error recovery (BER) – SRTR & CRTR • Copying the committed state of trailing thread to the leading thread • Only recover transient fault • Reunion • Squash the speculative state and restore last checkpoint
Proposal method • Dynamic core coupling (DCC) • Communication with system bus • Latency may greater than statically coupled • Increasing system bus traffic • Using checkpoint interval to decrease bus traffic • Benefit • system degrading is 50% of statically coupled • Pairing together by similar characteristics of core • Minimizing hot spots on distant core Statically coupled dynamically coupled
Private cache modifications • Support long checkpoint intervals • Buffer large number of memory stores • Add one bit in each line of the local cache • unverified: a cache line is written • Can’t write back to lower level memory • Be cleared and written back to level memory after checkpoint • Protect all caches by ECC • Shared memory - L2 cache • Allow each processor to redundantly load • Only one processor can write back dirty cache line • Master: a processor can write back dirty data • Slave: evict update cache line without writing back • Both ignore coherence actions
Private cache modifications (cont.) • Deadlock • If unverified dirty lines are allowed remain in the cache after the application is descheduled by OS • Next application write new data to cache • All cache is locked • Before control is transferred, all unverified should be verified • Context switch • Master -> slave • Writing all verified data back to memory • Slave -> master • Flush cache
Synchronization • Receiving a scheduled or unscheduled checkpoint request • Leading processor or Trailing processor • Restoring last checkpoint and checkpoint interval repeat when compare fail • Can’t finish in a fixed timeout period • Rollback and new checkpoint interval is half of last interval Leading Normal execution Compare result Execute enough Instr. wait Trailing Catch-up Trailing State compression Broadcast current state Compare result Leading checkpoint
State compression • Reduce the bandwidth requirement of comparing state between 2 cores • Register file • Compress all the state at the end of a checkpoint interval • Memory store • Compress each cycle during the interval Compress address and data Store application’s process control block
Recover • Backward Error Recover (BER) • Rollback last state, invalidate all unverified cache line • Transient fault can be solved by recover but permanent fault … • Forward Error Recover (FER) • a TMR request form the master core of cache controller • Set the flag in the kernel’s address space • A predetermined node jump to interrupt vector for calling OS • Allocate a third core • Copy last state to the new core Master Slave 3rd core
Parallel application support • Node: a particular master-slave pair • Remote node: refer to all other nodes • Checkpoint • each nodes can issue checkpoint • Bus controller issues synchronization requests to all nodes • Send acknowledgment to bus controller after master synchronize completely with slave • Bus controller issue checkpoint request • If not match, all processors rollback last checkpoint 2.Synchronizing 5.Compare node node 3.Ack signal M S M S Checkpoint occur 4.Checkpoint request 1.Synchronization request bus
Coherence in parallelism • Issue • Cache line data may be unverified • Can’t write back to memory • MOESI • Allowing copy modified data to another cache • Considering master or slave behavior in sharing decisions • Master supply data to remote node • Slave reads are treated as ordinary read for remote state transitions, but update will ignore • Read-exclusive but the data of remote is dirty unverified • Slave must copies and marks to unverified • If cache buffer overflow, all cores rollback • Invalidate remote by update but the data of remote is dirty unverified • If no copy in the slave, turn into read-exclusive • If having copy in the slave, mark to dirty unverified
Master-slave consistency in parallelism • Master-slave memory access window (window) • Remote intervention constraint • A node open a R/W window, other nodes can’t open W window • Age table for implementation • Total number of loads and stores • Detect memory operations for remote intervention constraint • Direct-mapped, untagged SRAM open open close
Before Experiment • Hardware & software cost • Compare with SRTR, CRTR, and Reunion • Ability of fault tolerant • Performance degrade or upgrade • Power consumption
Model architecture • For single application • For parallel application on global checkpoint • Adding on each node • bus arbiter’s synchronization request • master-slave synchronization • Master gap is less than 200 cycles (average 100 cycles) with slave • Ignore waiting time • the checkpoint latency
DCC overhead - Sequential • Sequential applications: SPEC2000 • Baseline: single core • Most of spending time • synchronization of cores, compressing the register file, and communicating results over the system bus Checkpoint interval 20% 5% 3%
DCC overhead - Parallel • Parallel applications: a set of scalable scientific and data mining • Speedup • Overhead with 64-entry age table • Performance overhead: 4~5% at the 8 threads Kmeans largest overhead Barnes smallest overhead
Relaxed input replication • Reunion proposes a simpler scheme of relaxed input replication • Why not use it ? • It relies on dedicated communication channels • and …. more read-write sharing Little read-write sharing
Performance under manufacturing defects • Baseline: a sequential run without fault tolerance • 2 defective core • Static-DMR (SDMR) is an ideal without overhead • Randomly distribution • May in the same pair of SDMR 5.56 4.95 2.63 1.97
Conclusions • Dynamic core coupling (DCC) • Allowing any CMP cores to verify each other’s execution without dedicated communication channels • Cover form permanent fault without constant TMR • My comment • Discussing some issues for fault tolerance • Synchronization • Input replication • Output comparison • Coherence in the parallelism • It need OS support for the processors coupling
Simultaneous multithreading (SMT) • One of the thread-level parallelism (TLP) • Executing multiple instructions from multiple thread at the time • Architecture • superscalar Picture from “Computer Architecture – a Quantitative Approach”, John L. Hennessy, David A. Patterson
MOESI cache coherency protocol Read Hit Reset Reset Write Hit Read Miss/BusRd Exclusive Invalid Invalid Exclusive Read Hit Write Miss (WB memory) Write Hit Read Miss/BusRd Write Hit Write Hit/WB Write Hit/WB Write Hit Shared Modified Shared Modified Write Hit Read Hit Owned Owned Read Hit Read Hit Read Hit Write Hit Local Remote Read Hit Read Hit