1 / 28

Luis E. Cordova, Duncan A. Buell

A Novel High-level Dynamic Hardware-Software Remapping Technique for Mission Critical Reconfigurable Computers. Luis E. Cordova, Duncan A. Buell. Outline. Problem and definitions Motivation Architecture N Techniques Advantages Disadvantages Lessons learned. Problem and Definitions.

yovela
Download Presentation

Luis E. Cordova, Duncan A. Buell

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Novel High-level Dynamic Hardware-Software Remapping Technique for Mission Critical Reconfigurable Computers Luis E. Cordova, Duncan A. Buell Cordova

  2. Outline • Problem and definitions • Motivation • Architecture • N Techniques • Advantages • Disadvantages • Lessons learned Cordova

  3. Problem and Definitions 1. RCs are built from FPGAs + CPUs + memories + &c • RC are general purpose embedded platforms • RC are used to accelerate scientific applications Problem: • Achieve fault tolerance over heterogeneous hardware • RC requires knowledge of where the electronics inside the satellite is used, its orbit, for how long, and which direction it is facing  solution is adaptability Notes: There are more FPGAs than microprocessors in an RC. (examples: SRC, BEE, &c)This is the egg-chicken dilema! “SRAM-based FPGAs are less reliable than microprocessors” but reconfigurable. Cordova

  4. Motivation * Ground tracking of LEO orbit of CEASE device: Source: [Amptek03] Cordova

  5. P4 P4 Control (2.8GHz) (2.8GHz) FPGA 22400 22400 XC2V6000 / / MB/s MB/s 2x800 MB/s L2 L2 / (6x64 bits) 4256 MB/s On-Board Memory (24 MB) MIOC Max payload rate is 1400 MB/s 1064 MB/s 4256 MB/s / / / 4800 MB/s 4800 MB/s (6x 64 bits) (6x 64 bits) / / Computer PCI-X / Memory 1064 MB/s 2400 MB/s (8 GB) FPGA 1 FPGA 2 (192 bits) XC2V6000 XC2V6000 DDR uP Interface (108 bits) Board / SNAP / (108 bits) Chain 2400 MB/s for each port Ports Source: [SRC] Case: SRC Hardware Architecture Cordova

  6. Fault Tolerance Techniques • Dynamic FPGA-HOST remapping • Dynamic FPGA-FPGA remapping • FPGA Checkpointing • System-level radiation tolerance • Sanity checks with golden copy • Streaming Heart Beat Signal • Redundancy-based Data integrity • Control Flow Tolerance • Memory scrubbing • Hardware-Software Backup Threading • Dynamic Spatial Radiation Tolerance • HW-HW and HW-SW injection Remapping and Recovery Monitoring Protection Profiling Cordova

  7. System Dynamic Remapping Dynamic Redistribution between uProc and FPGAs: Radiation Environment(SETs/SEUs) Program code main( ) { comp 1 … comp N } // end Faults on uProc handled by other methods comp 4 comp 7 Speedup demand of the computation Faults on FPGA side User FPGA 1 User FPGA 2 Trade-offs:- parallelism- tolerance- FPGA resources Cordova

  8. Static Host-FPGA Mapping Hybrid Computer Under Test (HCUT) parity & check ( ) main ( ) saboteur ( ) check & parity ( ) saboteur ( ) BRAM OBM map_function ( ) V self_repair ( ) diagnose ( ) Processor MAP reconfigurable fabric Cordova

  9. Remapping and Monitoring HierarchicalToleranceuProc levelMAP levelRTL level Host RadHard uProc Dynamic remappingbetween uProc and FPGAs Streaming heart beat On-board-memory (OBM) Bridge heart beat Dynamic remappingbetween FPGAs Chip 1 Chip 2 Cordova

  10. Top level Remapping hardware functionality mapped if (mapIt (mapnum1)) { fprintf (stdout, "Hybrid level 1 failed!"); fprintf (stdout, "Entering hybrid level 2."); if (mapIt(mapnum2)) { fprintf (stdout, "Hybrid level 2 failed!"); fprintf (stdout, "Entering level 3 (full software).\n"); /* Computation on Software */ computeInSoftware(A,B,C,D); } else { user2 (n, A, B, C, D, &time, 0); } } else { user1 (n, A, B, C, D, &time, 0); } more 1 2 3 less Cordova

  11. Block RAM ‘Flip-Flop’ Scrubbing // computation for (i=0;i<n;i++) { tmr_in = al[i]; saboteur = bl[i]; // reading input stream // Block RAM Scrubbing Technique if (i%2) { bram_rw = scrubb_flip [i]; // parity check flag error scrubb_flop [i] = bram_rw; } else { bram_rw =scrubb_flop [i]; // parity check flag error scrubb_flip [i] = bram_rw; } // bram_rw is used later on ... NEXT bram_rw even parity bits Scrubb flip Read Write Write Read Scrubb flop parity bits check odd check Block RAMs Cordova

  12. Hardware-Hardware & Software-Hardware Fault Injection // datapath level module redundancy -- DPLMR result_1 = tmr_in * bram_rw + (saboteur & 16LL); result_2 = tmr_in * bram_rw + (saboteur & 8LL); result_3 = tmr_in * bram_rw + (saboteur & 4LL); result_4 = tmr_in * bram_rw + (saboteur & 2LL); ... Redundant data-paths 1 to N bram_rw data-path 1 X + result_k data-path k tmr_in ... data-path N saboteur for k Hardware-Hardware(LFSR = linear feedback shift register) Software-Hardware (recall previous slide) saboteur = bl[i]; // reading input stream Cordova

  13. Dynamic Spatial Radiation Hardening 1 if ((result_1 == result_2) && (en_hub1 == 1) && (en_hub2 == 1)) { final_result = result_1; mul_diagnose_opt = 12; } else if ((result_2 == result_3) && (en_hub2 == 1) && (en_hub3 == 1)) { final_result = result_2; mul_diagnose_opt = 23; } else if ((result_3 == result_4) && (en_hub3 == 1) && (en_hub4 == 1)) { final_result = result_3; mul_diagnose_opt = 34; } else if ((result_4 == result_5) && (en_hub4 == 1) && (en_hub5 == 1)) { final_result = dresult_4; mul_diagnose_opt = 45; } else if ((result_5 == result_1) && (en_hub5 == 1) && (en_hub1 == 1)) { final_result = result_5; mul_diagnose_opt = 51; } else { final_result = result_5; mul_diagnose_opt = 55; } Multi-diagnose Option Enabling Hub result_A data-path 1 Voting data-path k result_B final_result ... result_C data-path N Cordova

  14. Dynamic Spatial Radiation Hardening 2 // on-next-iteration do enable/disable redundant datapaths circularly if (temp_mul_diagnose_opt != mul_diagnose_opt) { temp_v = en_hub5; en_hub5 = en_hub4; en_hub4 = en_hub3; en_hub3 = en_hub2; en_hub2 = en_hub1; en_hub1 = temp_v; } temp_mul_diagnose_opt = mul_diagnose_opt; en_hub1 1 N = 5 1 en_hub2 Enableddata-pathsare 1, 2, 3 (*) temp_v en_hub3 1 0 en_hub4 en_hub5 0 Multi-diagnose Option result_A data-path 1 Enabling Hub Voting data-path k result_B final_result ... result_C data-path N * implementing an LFSR is similar Cordova

  15. Control Flow Tolerance: IF statement // Agent-based control flow technique #define xor(x,y) (x & !y)|(!x & y) control_flag1 = 0; control_flag2 = 0; ... if (condition) { control_flag1 = 1; ... } ... if (condition &tolerance) { control_flag2 = 1; ... } error_flag = xor(xor(condition, control_flag1), control_flag2) ...; if true mux control_flag1 if false mux control_flag2 condition error_flag Cordova

  16. Control Flow Tolerance: FOR statement // Agent-based control flow technique #pragma src parallel sections { #pragma src section { for (i=0; i<sz; i++) { control_counter1++; } } #pragma src section { for (i=0; i<sz; i++) { control_counter1++; } } } if (control_counter1 == control_counter2) {error = 0;} else {error = 1;} Dummy path Data path counter2 counter1 = error Cordova

  17. Resource Utilization Area is crucial to assess efficiency but it is also a flexible variable that we can tune with our programming model Table I. Resource Utilization * 1 = bare-bones design 2 = radhard design moderate 3 = radhard design high Total for chip: xc2v6000-ff1517-4 33,792 slices (x2 FFs) 144 Mult/BRAM Cordova

  18. FPGA Checkpointing // attempt to back up the On-Board-Memory (OBM) banks if (status == temporary_failure) { obm_single_dma_stripe_backup(status, backed_up_obm_data); } else if (status == at_speed_backup) { obm_double_dma_looping_backup(status, backed_up_obm_data); } else { // FPGA unrecoverable backed_up_obm_data = null; status= 0; } control backed_up_obm_data Host RadHard uProc status A B C D E F G H Chip 1 Chip 2 Cordova

  19. Hardware-Software Backup Threading Two types of threads: 1. POSIX thread backup 2. FPGA leading thread FPGA routine uP1 comp x openMP backup Message Passing X FPGA routine uP2 comp x openMP backup Cordova

  20. Compute Data Integrity hw_valid // Compute Data Integrity technique int main(){ rst_count = 0; hw_valid = 0; ... for(i=0; i< compute_blocks; i++){ for (j=0; j<sz; j++) { if(hw_valid){ sw_array->aarray[j] = hw_array->aarray[j]; } else { hw_array->aarray[j] = sw_array->aarray[j]; } } ... hw_array if sw_array Cordova

  21. Hardware-Software Backup Threading // Backup threading technique pthread_create(&thread_hw, NULL, &foo_hw, NULL); pthread_create(&thread_sw, NULL, &foo_sw, NULL); pthread_testcancel(); pthread_join(thread_hw, NULL); pthread_join(thread_sw, NULL); printf(“compute_block done! \n"); if(rst_count > 2){ system("snap Reset"); rst_count = 0; } } printf("job done! \n"); return(0); } ... thread_hw thread_sw foo_hw foo_sw hw_valid =1 rst_count ++ hw_valid =0 Cordova

  22. “foo_SW” Software Thread // foo_sw : software version of function foo void *foo_sw(){ pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,NULL); printf("I am thread foo_sw \n"); for(j=0; j<sz; j++) { sw_array->aarray[j] = 1 + sw_array->aarray[j]; } status = pthread_cancel(thread_hw); pthread_testcancel(); printf("canceling thread_hw with status = %i\n", status); pthread_exit(NULL); return NULL; } foo_sw Cordova

  23. “foo_HW” Hardware Thread // foo_hw : hardware version of function foo void *foo_hw(){ pthread_setcanceltype(PTHREAD_CANCEL_ASYNCHRONOUS,NULL); printf("I am thread foo_hw \n"); rst_count++; foo_hw_map(hw_array->aarray, hw_array->mapno); rst_count--; hw_valid = 1; status = pthread_cancel(thread_sw); pthread_testcancel(); printf("canceling thread_sw with status = %i\n", status); pthread_exit(NULL); return NULL; } foo_hw Cordova

  24. SystemC – Calling MAP C from C++ • Offline • Development is seamless and based on code transformation • that can be copy/pasted to a MAP C design • Online • Online Interface (OIF). The MAP hardware is treated as an • object. Computation is performed at the high level e.g. main ( ) reset FIR input_valid output_data_ready sample result CLK display output_data_ready reset result stimulus CLK input_valid foo_hw sample Cordova

  25. Sanity Checking // Read back (supported if API supports it) p_bitstream_new = JTAG_bitstream_read_back(); error = compare(p_bitstream_new, p_bitstream_golden); // Sanity checking with hw module database foo_hw_1(argument_1, result_1); ... foo_hw_N(argument_N, result_N); for(i=0; i< modules; i++) { error[i] = compare(result_1, golden_1); } golden (sw-computed or stored) = error [ ] foo_1 ( ) foo_N() Cordova

  26. Advantages Hardening: Dynamic levels of radiation hardening or customization. System description is fully synthesizable in both SW (compiled-> processor) or HW (forged-> C to fpga compilation) Fault-injection: Fault injection can be specified at high level (ANSI C or Fortran) and can be interfaced with scripts for verification and test Simulation and emulation capabilities: At speed tolerance check, debugging, cycle accurate simulation, hardware emulation Cordova

  27. Disadvantages Too high level: • Optimization is aimed at first only by the use of a Hardware compiler • Further optimization is achieved by a skilled or experienced programmer • Fine tunning is possible at the expense of time yet this obstacle is being overcome by more advanced hardware compiler technology and released programmer techniques Cordova

  28. Leasons Learned • Tested High-level Advance Fault tolerance techniques • Develop high performance embedded computing techniques that are power aware and versatile to counteract different radiation scenarios • High performance supercomputing methodologies need of terrestrial-based radiation hardening due to amplifying effects in supercomputers comprising large number of processing elements Cordova

More Related