500 likes | 969 Views
E N D
1. Single Event Upset (SEU) Mitigating Techniques in a Space Radiation Environment for the FPGA based Iterative Repair Processor Group Presentation (11/30/2007)
Jeffrey M. Carver
2. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
3. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
4. Space Applications FPGAs are being used in space applications because of:
Low cost over ASICs
Reconfigurable ability
Can be optimized for a specific application
Problems that occur in space
Single Event Upsets (SEUs) occur when a memory cell changes values because of the radiation in the environment.
Radiation also plagues combinational logic by causing a temporary glitch that has been measured lasting from .3ns to 1.3ns.
For FPGAs this means that fault tolerant techniques need to be applied to protect the storage memory, configuration memory, and combinational logic on an FPGA.
5. Research Goal To find and apply fault tolerant techniques for a system designed for space applications (Iterative Repair Processor).
Once the fault techniques to apply have been identified, an SEU Simulator for testing the robustness of the technique will be developed and used. The techniques will then be applied and tested.
6. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
7. Triple Modular Redundancy (TMR) Is triplication of the module
with a voting circuit to vote on
the correct output of the device.
Variants of this concept are used.
Analog component to use for voting circuit
Using 2-3 voting circuits
with tri-state buffer.
TMR in time (The picture in the upper-right) This picture is an example from FPGA editor showing a tri-state buffer being used on an output port. This is an example implementing the voting circuit with tri-state buffer discussed below.
(Picture in the bottom-right) This picture shows the an example using 3 voting circuits with tri-state buffers.(The picture in the upper-right) This picture is an example from FPGA editor showing a tri-state buffer being used on an output port. This is an example implementing the voting circuit with tri-state buffer discussed below.
(Picture in the bottom-right) This picture shows the an example using 3 voting circuits with tri-state buffers.
8. Hamming Codes Hamming code is to
insert check bits
throughout the word.
Improved Hamming Code can require an extra check bit, but it appends check bits onto the end of the word.
Both can correct a single error in a word.
Hamming Relationship = # check bits required
Hamming Codes can also be implemented so that they can Double Error Detect (DED). This means you can detect a Multiple Event Upset (MEU) in the Word, but you can't fix it.Hamming Codes can also be implemented so that they can Double Error Detect (DED). This means you can detect a Multiple Event Upset (MEU) in the Word, but you can't fix it.
9. TMR vs. Hamming TMR
Requires at least a 200 percent increase in space.
It is good for small memory and state machines.
Hamming Codes
Good for large memories.
Requires check bits, Hamming Encoder, and Hamming Decoder.
Seen to increase timing delay over TMR.
Based the paper that this information came from, we decided to use TMR on small memory elements and Hamming Codes on the larger memory elements.
This is because Hamming Codes require resource space to implement the Hamming Encoder and Decoder which can be a large overhead on small memory elements.Based the paper that this information came from, we decided to use TMR on small memory elements and Hamming Codes on the larger memory elements.
This is because Hamming Codes require resource space to implement the Hamming Encoder and Decoder which can be a large overhead on small memory elements.
10. DWC-CED Double Redundancy with Comparison combined with Concurrent Error Detection (DWC-CED)
Two modules perform the same operation and their output is compared. (savings of area)
If the outputs do not match then it takes one more clock cycle to run the concurrent error detection method that finds which module is correct.
Problem is finding a test that detects all possible errors that can occur in a module. We did not use this method because of the paper that shows that it can be difficult to find a CED technique to find all possible errors. There were a few they were able to find 100% error coverage on, but on others they did not show one that had 100% coverage.We did not use this method because of the paper that shows that it can be difficult to find a CED technique to find all possible errors. There were a few they were able to find 100% error coverage on, but on others they did not show one that had 100% coverage.
11. Other Techniques Other techniques for SEUs and even Multiple Event Upsets (MEUs) in memory.
Cross Parity
Reed-Muller
Reed Solomon
Reed Solomon with Hamming Codes
Problem is the resource requirement to pull off these techniques.
These methods are more complex then some of the more popular ones discussed like TMR and Hamming Codes. Because they are more complex they require more resources to implement for the Encoder/Decoder. That is why we did not use these methods.These methods are more complex then some of the more popular ones discussed like TMR and Hamming Codes. Because they are more complex they require more resources to implement for the Encoder/Decoder. That is why we did not use these methods.
12. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
13. Configuration Frames 1 bit wide
Span an HCLK Row
16 CLBs in Height
Size is 41 32-bit words
Block Types
CLBs/CLKs/DSPs/IOBs
BRAM Interconnect
BRAM Contents
Multiple minor frames per major column A Frame Address on the Virtex-4 is comprised of the following fields:
Block Type
Top/Bottom Bit
HCLK row
Major Column
Minor Column
In order to understand how the frames are laid on the board, we are having a brief discussion on how a frame address is composed.
So if you wanted to access the configuration frame on the upper-left part of the board you would give the following values for the fields comprising a frame address:
BlockType=CLBS/CLKs/DSPs/IOBs
top/bottom bit=0
HCLKrow=2
MajorFrame=0
MinorFrame=0
A Frame Address on the Virtex-4 is comprised of the following fields:
Block Type
Top/Bottom Bit
HCLK row
Major Column
Minor Column
In order to understand how the frames are laid on the board, we are having a brief discussion on how a frame address is composed.
So if you wanted to access the configuration frame on the upper-left part of the board you would give the following values for the fields comprising a frame address:
BlockType=CLBS/CLKs/DSPs/IOBs
top/bottom bit=0
HCLKrow=2
MajorFrame=0
MinorFrame=0
14. Major Frames Numbering Starts from 0 on the left and increases as going to the right
SX35 Example
CLBs/CLKs/DSPs/IOBs
CLBs: 1-6, 8-15, 17-30, 32-39, 41-46
CLKs: 24
DSP: 7, 16, 31, 40
IOBs: 0, 23, 47
BRAM Interconnect: 0-7
BRAM Content: 0-7
15. Minor Frames per Major Frame There are multiple minor frames per major frame. The number of minor frames depends on the type of major frame writing to.
Information for total minor frames per column type is from file xhwicap_i.h.
CLBs – 22 total minor frames
DSPs – 21 total minor frames
IOBs – 30 total minor frames
CLKs – 3 total minor frames
BRAM Interconnect – 20 total minor frames
BRAM Content – 64 total minor frames
Numbering is from 0 to totalMinorFrames-1 If you wanted to read every minor frame for the IOB in the upper-left part of the FPGA you would read with the following fields:
BlockType=CLBS/CLKs/DSPs/IOBs
top/bottom bit=0
HCLKrow=2
MajorFrame=0
MinorFrame=0-29 (change the minor frame number to get the next frame and continue until you have read for every possibly minor frame.
If you wanted to read every minor frame for the IOB in the upper-left part of the FPGA you would read with the following fields:
BlockType=CLBS/CLKs/DSPs/IOBs
top/bottom bit=0
HCLKrow=2
MajorFrame=0
MinorFrame=0-29 (change the minor frame number to get the next frame and continue until you have read for every possibly minor frame.
16. Frame Layout Size is 41 32-bit words (1312 bits total)
Frames in the bottom half are mirror images in the top half with the exception of the vertical HCLK rows that contain the global and regional clocks. (ug071.pdf – Xilinx)
Top Half: 1311 to 0
(word 40 to word 0)
Bottom Half: 0 to 1311
(word 0 to 40) The whole point of understanding the frames is that we can lay out the circuit to be tested in frames that are not shared with the rest of the circuits on the board. That way we test only corrupting the configuration frames in circuit that we want to simulate SEUs in. Without understanding the frame we could not simulate an SEU in the configuration frames because we would not know where we are simulating an SEU at. This information is also useful to avoid corrupting the configuration frames corresponding to the simulator circuit.The whole point of understanding the frames is that we can lay out the circuit to be tested in frames that are not shared with the rest of the circuits on the board. That way we test only corrupting the configuration frames in circuit that we want to simulate SEUs in. Without understanding the frame we could not simulate an SEU in the configuration frames because we would not know where we are simulating an SEU at. This information is also useful to avoid corrupting the configuration frames corresponding to the simulator circuit.
17. Fault Correction Techniques Techniques for repairing faults in the configuration frames of the FPGA
Scrubbing – Just reload the configuration data from a device like an SEU-immune EEPROM.
Error Checking and Correcting (ECC) frames
Embed Hamming Codes inside the configuration frame
Available in the Virtex-4 devices
In order for these to be used, a device must not use resources that use the configuration frames for memory (ex. Shift Registers). Have some circuit that reads out the configuration frames and writes out the corrected frame when an error is detected. Xilinx has provided a device that automatically does this for a design discussed in xapp714.pdf.
Shift registers use part of the configuration frame for memory. This means the configuration frame is constantly changing making it difficult to detect an SEU from just reading the configuration frame.Have some circuit that reads out the configuration frames and writes out the corrected frame when an error is detected. Xilinx has provided a device that automatically does this for a design discussed in xapp714.pdf.
Shift registers use part of the configuration frame for memory. This means the configuration frame is constantly changing making it difficult to detect an SEU from just reading the configuration frame.
18. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
19. DMRH Double Modular Redundancy with Hold
When disagreement, send
signal to ICAP Controller
that will scan/fix-up errors
in areas of modules.
Disagreement signal also
sent to controller to pause
at the current iteration.
If transient error, it will
disappear in 1 clock cycle
Best for combinational logic and parallel designs
Problem is the delay of time to fix-up frame(s) This method is great to save on space requirements as compared to TMR. The problem with this method is the time required to fix-up the frames after an error is detected. The advantage of this method over DWC-CED is that it can detect and fixup 100% error that in the configuration frames. It just takes a lot longer to do it as compared to DWC-CED.This method is great to save on space requirements as compared to TMR. The problem with this method is the time required to fix-up the frames after an error is detected. The advantage of this method over DWC-CED is that it can detect and fixup 100% error that in the configuration frames. It just takes a lot longer to do it as compared to DWC-CED.
20. Fan-out design Used in some of the multiplexers in the design.
Can tolerate a SEU in the LUTs
or 1 of lines after it is fanned out
to the slices.
The words being selected are
Hamming Code protected.
Reduces the need for redundancy
Problem is an upset that occurs
before the line is fanned out to
the different slices.
We are not protecting the muxes used in the design to see how much routing plays as a factor in designs. Since the words being selected in the mux are Hamming code protected, then a corruption in 1 bit in the word can be tolerated. So if a line is corrupted after the select line is fanned out to the muxes, it should be okay. The problem should occur is the line is corrupted before it is mapped out to the slices.
The picture below is showing a select line that is being mapped to 2 different slices.We are not protecting the muxes used in the design to see how much routing plays as a factor in designs. Since the words being selected in the mux are Hamming code protected, then a corruption in 1 bit in the word can be tolerated. So if a line is corrupted after the select line is fanned out to the muxes, it should be okay. The problem should occur is the line is corrupted before it is mapped out to the slices.
The picture below is showing a select line that is being mapped to 2 different slices.
21. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
22. Iterative Repair (IR) Processor Design The far left picture shows an overview of the IR Processor with the proposed techniques to be applied to it.
The upper-right picture shows the Simulator circuit and how it can interact with the IR Processor. Notice that MicroBlaze can communicates with the IR Processor by using the OPB Bus. The Error Detector detects if the IR Processor changed it's behavior from the last run that was done. The Memory of Best Scores holds the data from a run with no faults injected. The Continue controller is used to be able to stop of the IR Processor at different iterations.The far left picture shows an overview of the IR Processor with the proposed techniques to be applied to it.
The upper-right picture shows the Simulator circuit and how it can interact with the IR Processor. Notice that MicroBlaze can communicates with the IR Processor by using the OPB Bus. The Error Detector detects if the IR Processor changed it's behavior from the last run that was done. The Memory of Best Scores holds the data from a run with no faults injected. The Continue controller is used to be able to stop of the IR Processor at different iterations.
23. Copy Processor Notice that DMRH was applied to the combinational circuitry while TMR was applied to the control circuit.Notice that DMRH was applied to the combinational circuitry while TMR was applied to the control circuit.
24. Alter Processor Same thing as before. Notice that DMRH was applied to the combinational circuitry while TMR was applied to the control circuit and to the small memory element. The reason the Random Number Generator shows no fault protection is who cares if it alters the random number generator slightly. The problem would be if it stopped the random number generator from being able to generate new numbers.Same thing as before. Notice that DMRH was applied to the combinational circuitry while TMR was applied to the control circuit and to the small memory element. The reason the Random Number Generator shows no fault protection is who cares if it alters the random number generator slightly. The problem would be if it stopped the random number generator from being able to generate new numbers.
25. Evaluate Process Is comprised of three sub-processors
Dependency Graph Violation
Total Schedule Length
Resource Over-utilization This shows an overview of the Evaluate processor which is really comprised of three sub-processors.This shows an overview of the Evaluate processor which is really comprised of three sub-processors.
26. Dependency Graph Violation Sub-Processor Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.
27. Total Schedule Length Sub-Processor Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.
28. Resource Over-utilization Sub-Processor Originally we thought that we could detect and fixup a frame in a short period of time. Recent testing using the HWICAP on the OPB bus gave us results of it taking 18us (1800 clock cycles) to write a configuration frame and 30us (3000 clock cycles) to read/write a configuraiton frame.
The max latency of the IR processor for an iteration is 235 clock cycles. So we thought of TMR this to avoid any hold in the overall design, but since we will have hold on the other stages as well, we might just apply similar techniques like we did for the other stages. We originally thought since all other processors complete before this process, we would have time to fixup errors in the other stages before this processor finished. So we were going to make sure this processor didn't hold up the iteration completion by TMR the entire thing, but the other stages could use DMRH as they had time to spare to wait for the fixup. With these new measurements the other stages will cause a holdup on the itartion completion, so DMRH poses more overhead than was originally thought.Originally we thought that we could detect and fixup a frame in a short period of time. Recent testing using the HWICAP on the OPB bus gave us results of it taking 18us (1800 clock cycles) to write a configuration frame and 30us (3000 clock cycles) to read/write a configuraiton frame.
The max latency of the IR processor for an iteration is 235 clock cycles. So we thought of TMR this to avoid any hold in the overall design, but since we will have hold on the other stages as well, we might just apply similar techniques like we did for the other stages. We originally thought since all other processors complete before this process, we would have time to fixup errors in the other stages before this processor finished. So we were going to make sure this processor didn't hold up the iteration completion by TMR the entire thing, but the other stages could use DMRH as they had time to spare to wait for the fixup. With these new measurements the other stages will cause a holdup on the itartion completion, so DMRH poses more overhead than was originally thought.
29. Accept Processor Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.
30. Adjust Temperature Processor Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.Notice DMRH on combinational elements and TMR on the small memory elements. TMR is also applied on the control circuit.
31. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
32. BYU SEU Simulator Requires 3 Virtex 1000 FPGAs
Does not directly corrupt flip-flops
Corrupts bits in bitstream Advantage is that the design to test is on a seperate FPGA board.Advantage is that the design to test is on a seperate FPGA board.
33. Xilinx SEU Simulator (xapp714) Requires 1 Virtex-4 FPGA
Does not directly corrupt flip-flops
Can not see what frame address and configuration bit is being corrupted. (Is stated to start from first bit in configuration memory)
Clunky interface to use for simulating SEUs
Uses embedded ECC frames
Corrupts every configuration frame on the board. Unknown how/if it actually corrupts BRAM Interconnect and Content frames. The xapp714 device was designed for performing autonomous correction and detection of SEUs in the configuration frames. The added feature for error injection on the device was not a priority in development and thus was lacking features we wanted to use for simulating SEUs.The xapp714 device was designed for performing autonomous correction and detection of SEUs in the configuration frames. The added feature for error injection on the device was not a priority in development and thus was lacking features we wanted to use for simulating SEUs.
34. USU SEU Simulator (Tool Flow) There are a few steps that have to be done by the user before the simulator will work correctly. The user has to specify what output to observe for changes. The user has to provide in the design some way to pause the design. By pause this means that the clock is running and FFs are not changing. Without the pause feature the only test that can be run for the Flip flops is the Stuck-At Tests.
The user has to specify in the code what frames to corrupt, but this should only take a minute. The problem is that this relies on the user to understand how frames are laid out on the FPGA. Without specifying what frames to corrupt the only test that can be run is the test that corrupts the specific elements on the FPGA (LUTs, SRINV mux, FFs).
Currently the user specifies how long in clock cycles it takes the design to run. This can be changed to be automated in the future. The reason to specify how long it runs is to have a timeout implemented if the design will never finish.
The user has to specify what major frames correspond to DSPs, IOBs, and GCLKs for the board that the tests are running on. This has to be done in the simulator and output imaging code.There are a few steps that have to be done by the user before the simulator will work correctly. The user has to specify what output to observe for changes. The user has to provide in the design some way to pause the design. By pause this means that the clock is running and FFs are not changing. Without the pause feature the only test that can be run for the Flip flops is the Stuck-At Tests.
The user has to specify in the code what frames to corrupt, but this should only take a minute. The problem is that this relies on the user to understand how frames are laid out on the FPGA. Without specifying what frames to corrupt the only test that can be run is the test that corrupts the specific elements on the FPGA (LUTs, SRINV mux, FFs).
Currently the user specifies how long in clock cycles it takes the design to run. This can be changed to be automated in the future. The reason to specify how long it runs is to have a timeout implemented if the design will never finish.
The user has to specify what major frames correspond to DSPs, IOBs, and GCLKs for the board that the tests are running on. This has to be done in the simulator and output imaging code.
35. USU SEU Simulator Uses 1 FPGA (Tester circuit and design to test on same circuit)
Corrupts all bits in
configuration frames in the
design to test area.
Tests corrupting FFs
3 Techniques
GCAPTURE/GRESTORE
Intermediate Corruption
Stuck-At Tests The Design to Test does not share configuration frames with the simulator circuit. This is so we can simulate corrupting frames only in the Design to Test are and not in the Simulator Circuit.
We test going sequentially through each bit in the configuration frame and test changing it to the opposite value. If a change in behavior is observed in the IR Processor we mark this configuration bit as sensitive.The Design to Test does not share configuration frames with the simulator circuit. This is so we can simulate corrupting frames only in the Design to Test are and not in the Simulator Circuit.
We test going sequentially through each bit in the configuration frame and test changing it to the opposite value. If a change in behavior is observed in the IR Processor we mark this configuration bit as sensitive.
36. Flip-Flop Architecture FFs share all lines
except D (Data) input,
and XQ/YQ output
SRINV mux controls
reset line given to FFs
SRMODE configuration
bit determines what FF
is set to on reset.
INIT bit is value of FF
when bitstream first loaded onto FPGA
If radiation causes the SRINV mux to select the other input, this will cause a reset to be sent to both flip flops (assuming they were not currently being reset). This poses a problem as it can cause an upset in both flip flops in a design resulting in a Multiple Event Upset in the overall design.If radiation causes the SRINV mux to select the other input, this will cause a reset to be sent to both flip flops (assuming they were not currently being reset). This poses a problem as it can cause an upset in both flip flops in a design resulting in a Multiple Event Upset in the overall design.
37. GCAPTURE/GRESTORE Method GCAPTURE – loads the INIT bits of all FFs and Input/Output Buffer (IOB) registers with the current value of the register
GRESTORE – sets all registers to their INIT bit values.
Put device into a paused state (where FFs are not changing, SR input to FFs low, and clock signal still active).
Then do a GCAPTURE, change INIT bit in desired FF. Follow with GRESTORE. GCAPTURE command can be issued by writing a sequence of instruction through the ICAP port or by instantiating the CAPTURE primitive in a design and setting high the input to the CAPTURE primitive.
GCAPTURE is the way we can get access to the current value in the FF. This enables us to simulate an upset in the flip flop.
Problem with using a GRESTORE in our simulator is that it will restore the FFs in the simulator circuit as well. This means we will restore back to a state that was not intended in the simulator circuit.GCAPTURE command can be issued by writing a sequence of instruction through the ICAP port or by instantiating the CAPTURE primitive in a design and setting high the input to the CAPTURE primitive.
GCAPTURE is the way we can get access to the current value in the FF. This enables us to simulate an upset in the flip flop.
Problem with using a GRESTORE in our simulator is that it will restore the FFs in the simulator circuit as well. This means we will restore back to a state that was not intended in the simulator circuit.
38. Intermediate Corruption Method Put device into a paused state.
Issue a GCAPTURE command
Based on the INIT bits, set the SRMODE of the 2 FFs in the slice.
Set the FF to change to set on reset to the opposite value it is at.
Set the other FF to reset to it’s current value
Change the SRINV multiplexer to select the other value. (This causes reset of FFs)
Fix-up the SRINV multiplexer, SRMODE bits.
Device can then be resumed. This method works by changing the value in the FF through reseting the FF.
We first get the current value of the FF. We change the SRMODE configuration bits of the FF to have the FF reset to the desired value when a reset line occurs. We cause a reset to occur by changing the SRINV mux to select the other input.
Before continuing the device we undo the changes we did so that we are only simulating an SEU occuring on that iteration. If we did not fixup the changes it would keep simulating an SEU occuring in the FFs.
The problem with this method is getting the device into a paused state for any clock cycle.This method works by changing the value in the FF through reseting the FF.
We first get the current value of the FF. We change the SRMODE configuration bits of the FF to have the FF reset to the desired value when a reset line occurs. We cause a reset to occur by changing the SRINV mux to select the other input.
Before continuing the device we undo the changes we did so that we are only simulating an SEU occuring on that iteration. If we did not fixup the changes it would keep simulating an SEU occuring in the FFs.
The problem with this method is getting the device into a paused state for any clock cycle.
39. Stuck-At Method Device can be in a paused state.
In this method FFs are configured to be stuck at a desired value during operation of device.
Configure SRMODE bits to the desired value to be stuck at. Possible combos {00, 01, 10, 11}
Change SRINV mux to select opposite line.
After device run, fix-up changes done.
Best if device never resets FFs during operation.
Helps reveal SEU sensitivity of specific FFs on any clock cycles.
40. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
41. Designed Mapped from PlanAhead This is showing part of the IR Processor that was mapped to the Virtex 4 board. The image in the bottom-right helps describe what resources Plan Ahead is showing being mapped. Note that if a routing lined is mapped through a LUT that Plan Ahead does not always show it as being mapped.This is showing part of the IR Processor that was mapped to the Virtex 4 board. The image in the bottom-right helps describe what resources Plan Ahead is showing being mapped. Note that if a routing lined is mapped through a LUT that Plan Ahead does not always show it as being mapped.
42. Bit Markup of Sensitive Resources This is the output image format the BYU did. It shows the sensitive areas of the design based on position in the configuration frame. So you can get a general idea of sensitive areas of the FPGA, but not exact information.This is the output image format the BYU did. It shows the sensitive areas of the design based on position in the configuration frame. So you can get a general idea of sensitive areas of the FPGA, but not exact information.
43. Map of Sensitive Resources This tests shows the results of the flip flop tests that we ran. It also shows testing some specific resources in the Slice like LUTs, SRINV mux, and configuration bits for some of the resources. This is to help give the user specific information about what resources are sensitive to a SEU. Knowing what FFs are sensitive to SEU is important because FFs are used in state machines. This output is to give exact Slices that are known to be sensitve to SEU, instead of the approximation from the bit markup technique. The problem with this is that we have to rely on the devlopers to tell us what bits correspond to what resources. That is why we could only show for those elements in the Slice, and not every element in the Slice. So the Bit Markup method is good to give a general idea of every possible configuration bit. This display format here is great to give specific details on what is sensitive and also about the FFs in the design.This tests shows the results of the flip flop tests that we ran. It also shows testing some specific resources in the Slice like LUTs, SRINV mux, and configuration bits for some of the resources. This is to help give the user specific information about what resources are sensitive to a SEU. Knowing what FFs are sensitive to SEU is important because FFs are used in state machines. This output is to give exact Slices that are known to be sensitve to SEU, instead of the approximation from the bit markup technique. The problem with this is that we have to rely on the devlopers to tell us what bits correspond to what resources. That is why we could only show for those elements in the Slice, and not every element in the Slice. So the Bit Markup method is good to give a general idea of every possible configuration bit. This display format here is great to give specific details on what is sensitive and also about the FFs in the design.
44. CLBs Tested The 2 SEUs comes from taking 42*4.9%.The 2 SEUs comes from taking 42*4.9%.
45. DSPs, BRAMs Tested Look at images included to see the markup showing the DSPs, BRAM interconnect, and BRAM content. Note that areas where DSPs aren't mapped in the design that there is sensitive bits. This is because there is still routing through the DSPs that is sensitive to a SEU. Another interesting thing observed is that we were able to wipe out a ROM in the design by changing one configuration bit in the BRAM interconnect. This means that a SEU in a configuration bit can have side effects to other resources in the design.
Not sure why we can't write to these bits, but figured them out from testing. So if you ever wanted to do partial reconfiguration on the BRAM content do not write a '1' to these locations.Look at images included to see the markup showing the DSPs, BRAM interconnect, and BRAM content. Note that areas where DSPs aren't mapped in the design that there is sensitive bits. This is because there is still routing through the DSPs that is sensitive to a SEU. Another interesting thing observed is that we were able to wipe out a ROM in the design by changing one configuration bit in the BRAM interconnect. This means that a SEU in a configuration bit can have side effects to other resources in the design.
Not sure why we can't write to these bits, but figured them out from testing. So if you ever wanted to do partial reconfiguration on the BRAM content do not write a '1' to these locations.
46. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
47. Conclusions Simulator Tool status
Simulates SEUs in CLBs, FFs, DSPs, BRAM interconnects, and BRAM content.
Needs to have a method to reload entire device when a permanent change in pattern is detected.
Need to test full TMR design
Need to test proposed fault tolerant design
Have fault techniques automatically applied when IR Processor is being generated
Thesis defense in August? Ram is going to help me put it in a partial reconfiguration region. That way when a side effect is detected like in the BRAM content we will just reload the design. The reason we did not reload the entire design every time is that it would make the tests take too long to complete. Currently it takes around a day to run all the tests. So we only want to reload the entire bitstream for the circuit under test when necessary. We just want to normally fix up only the configuration frame we changed to keep the simulator running fast.
From Jonathan's work, we will take the automatically generated IR Processor and have it generate the circuit with applied fault techniques proposed. The techniques will have it apply will be TMR, DMRH, and Hamming Codes. We will anaylze the graphs and see what are memory elements and what is combinational logic to know what technique to apply.Ram is going to help me put it in a partial reconfiguration region. That way when a side effect is detected like in the BRAM content we will just reload the design. The reason we did not reload the entire design every time is that it would make the tests take too long to complete. Currently it takes around a day to run all the tests. So we only want to reload the entire bitstream for the circuit under test when necessary. We just want to normally fix up only the configuration frame we changed to keep the simulator running fast.
From Jonathan's work, we will take the automatically generated IR Processor and have it generate the circuit with applied fault techniques proposed. The techniques will have it apply will be TMR, DMRH, and Hamming Codes. We will anaylze the graphs and see what are memory elements and what is combinational logic to know what technique to apply.
48. Outline Introduction
Background
Fault Tolerant Techniques
Configuration Frames
DMRH and Fan-out design
Iterative Repair Processor Fault Protected
SEU Simulator
Current Results
Conclusions and Program of Study
Publications
49. Publications Journal Articles under review
IET Transactions on Computers and Digital Techniques
Phillips, J., Sudarsanam, A., Kallam, R., Carver, J., and Dasu, A., “Methodology to Derive Polymorphic Soft-IP Cores for FPGAs”
50. Publications Conference Papers under review
DAC 2008
Carver, J., Phillips, J., and Dasu, A., “Improved SEU Simulator for Virtex 4 FPGAs”
51. Publications Planned Journal Papers
IEEE Design & Test of Computers or IEEE Transactions on Reliability
Carver, J., Phillips, J., and Dasu, A., “SEU Mitigating Techniques for a FPGA based Iterative Repair Processor”