200 likes | 514 Views
Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board. Robert Hillman, Mark Conrad, Phil Layton, Chad Thibodeau Maxwell Technologies. Outline. Commercial & Rad-Hard Processor Trends SCS750 Design Strategy Radiation Mitigation Radiation Testing
E N D
Space Processor Radiation Mitigation and Validation Techniques for an 1,800 MIPS Processor Board Robert Hillman, Mark Conrad, Phil Layton, Chad Thibodeau Maxwell Technologies 1
Outline • Commercial & Rad-Hard Processor Trends • SCS750 Design Strategy • Radiation Mitigation • Radiation Testing • Radiation Analysis 2
Commercial vs. Rad-Hard MIPS PPC750GX PPC750FX PPC750 PPC603 80486 Technology Gap >10 years 3
Rad-Hard Upset Trend: Less SEU Immunity PPC750GX PPC750FX MEAN TIME TO UPSET DECREASING SUBSTANTIALLY! PPC750 PPC603 80486 Mean Time to Upset Decreasing Exponentially Due to Increasing Bits Per Device (Caches) * Assumes 1E-10 Upsets/bit-day Processor Bits Increasing Exponentially Dominated by L1 and L2 Cache Sizes Error Correction is Needed Rad-Hard is insufficient with increasing number of bits 4
1000X Importance of L2 Cache On-chip L1 Cache Boundary (32kByte) SCS750 @ 800MHz with 512KByte L2 Cache RH PowerPC @ 133MHz without L2 Cache L2 Cache on-chip is Critical for High Performance 5
SCS750 Design Strategy Commercial Technology Mitigation Technique Result Latest SOI PowerPC @ 800MHz TMR/Resynch & Scrubbing Double Device Correct & HW Scrub (Reed-Solomon) SCS750 1 Uncorrected Error > 300 Years High Performance SDRAM Actel RT-AX Built-in TMR On-Board Control Logic Better Upset Immunity Than Rad-Hard SBC’s! ~ 1 Upset Per Day Unacceptable 6
SCS750 Prototype Engineering Module (PEM) – Early Dev. Platform FM Boards will be conduction cooled with class S or equivalent components 7
Radiation Tolerance • Entire board designed from the ground up to have the highest upset immunity and highest performance • All components Latch-Up Immune • Component Screening • Effective SEU rate (entire board!) < 1E-5/board-day • One error every 300 years or better! • Reed-Solomon, TMR/Scrubbing, component selection • Typical Geosynchronous orbit • Total Dose >= 100KRad • Component screening and/or shielding 9
PowerPC 750FX™ Features • 0.13um SOI, Copper interconnect, SiLK • Extremely low power, runs at 1.2V – 1.5V • 400MHz, ~ 2Watts each >900 MIPS • 800MHz, ~ 4 Watts each > 1,800 MIPS • L1 cache 32KB I, 32KB D with parity • L2 Cache 512KByte on-chip, built-in SEC/DED EDAC • L2 cache runs at full processor clock rate • Tags are parity protected • Dual PLL’s for on-the fly software control of clock rate • Future products are drop-in compatible 10
Processor Correction Methodology • Triple Modular Redundant (TMR) Processing • Three processors, each completely isolated • No delay in voting (voting and controller all on one chip!) • Detection • Single processor error: Flags & Isolates Processor • Detection of double processor error & auto restart • Resynchronization & Scrubbing • Saves processor states, Resets, Re-loads processor • Proven to correct any single event effect in the processors • < 1 ms delay, < 0.1% overhead with 1 second scrub Transparent to Application Software! 11
Processor Correction Snapshot All three are Resynchronized Processor 2 is Upset CPU 2 Held in Reset 12
Processor Error Correction Flow Software Transparent No Uncorrected Errors Overhead < 0.1% Occurs typically once in >12,000 years Start State No Errors User Called Resynch Scrub Timer Expired One Processor Error Isolate Processor & Flag Error Double Processor Error Restart 3 Processors & Flag double error Waits for Software Semaphore to Continue Resynchronize Processors <1ms delay 13
SDRAM Error Correction Comparison Worst Best Assumptions: 256 Mbyte Memory Memory Scrub 1 per day SCS750 has built-in, programmable hardware memory scrubbing (DMA) 14
Prototype Demonstrated TMR (Voting) Error Detection Resynchronization Radiation Performance Sept 02 Development Plan Phase I – July 03 Phase II – Dec 03 SCS750 EM – Q2 04 • Engineering boards will use “E” or “I” (Engineering or Industrial Grade) SCS750 FM – Q4 04 • Flight board • Fully qualified • All parts class “S” or equivalent SCS750 Proto Engineering • Software compatible with flight • Same form factor • Commercial parts • Customer development Actel Axcellerator, RT-AX For Flight Xilinx FPGA’s for Prototype Development 15
Maxwell 750FX vs. JPL 750FX Data (Maxwell) Single Processor Error Comparison Maxwell (Dynamic) vs. JPL (Register testing) Good Correlation 16
Texas A&M Cyclotron Facility: SCS750 PEM - Processor Test Setup Upsets by Processor Good Uniformity 17
Double Processor Error Rate Cross section as a function of scrub rate SCS750 Triple Processor Results Excellent correlation with actual vs. computed! 18
Notes: Assuming equivalent shielding of 100 mil Aluminum GCR = Galactic Cosmic Ray background at solar minimum DCF = JPL Design Case Flare (at one A.U.) (1 day) * Scrub @ 10/sec, performance overhead of < 1% Uncorrected Error Rate Comparison Source: G. Swift (JPL) RADECs 2003 19
CONCLUSION • Under heavy-ion irradiation, the SCS750 upset mitigation scheme detected and corrected all single processor errors. • Upon the coincidence of double processor error, the SCS750 successfully recovered with a reboot and reported the double processor error. • Power cycling was never required for correction. SCS750 architecture using commercial processors gives better upset immunity and higher performance than Rad-Hard Processors 20