220 likes | 411 Views
Scientific Computing in Space Using COTS Processors. Roger Sowada Honeywell DSES roger.j. sowada@honeywell.com. Jeremy Ramos Honeywell DSES jeremy.ramos@honeywell.com. David Lupia Honeywell DSES david.lupia@honeywell.com. Agenda. Introduction Background Detail Description
E N D
Scientific Computing in Space Using COTS Processors Roger Sowada Honeywell DSES roger.j. sowada@honeywell.com Jeremy Ramos Honeywell DSES jeremy.ramos@honeywell.com David Lupia Honeywell DSES david.lupia@honeywell.com
Agenda • Introduction • Background • Detail Description • Implementation Approach • Development Efforts • Acknowledgements • University of Florida • Key contributors to software prototype effort and research • Alan George and the High-performance Computing and Simulation Lab • Physical Sciences Inc. • SEU Sensor Provider • Gary Galica and Robin Cox • WW Technologies Inc. • RPI Middleware Provider • Chris Walters and Technical Staff • NASA New Millennium Program • Program Sponsor
The success of recent rover missions are a perfect example of the type of science we want to support Though returns from rover missions are significant they could be orders of magnitude greater with sufficient autonomy and on-board processing capabilities Processing Platforms for New Science • Similarly, deep space probes as well as Earth orbiting instruments can benefit from increases in on-board processing capabilities • In all cases increases in science data returns are dependant on the spacecraft’s processing platform capabilities
Payload Processing Conceptual Model Sample-Level Signal Processing Frame-Level Signal Processing High-Level Logic Operations Time Dependent Processing TDP Object Dependent Processing ODP Mission Dependent Processing MDP Telemetry Sensor Array Low BW DATA RATES 100,000 10,000 10,000 1,000 Algorithm Complexity (MIPSMOPS/) Data Rates (Mbps) 1,000 100 Algorithm Complexity/Abstraction 100 10 10 1 TDP ODP MDP
Technology Advance • A spacecraft onboard payload data processing system architecture, including a software framework and set of fault tolerance techniques, which provides: • An architecture and methodology that enables COTS based, high performance, scalable, multi-computer systems, incorporating reconfigurable co-processors, and supporting parallel/distributed processing for science codes, that accommodates future COTS parts/standards through upgrades. • An application software development and runtime environment that is familiar to science application developers, and facilitates porting of applications from the laboratory to the spacecraft payload data processor. • An autonomous and adaptive controller for fault tolerance configuration, responsive to environment, application criticality and system mode, that maintains required dependability and availability while optimizing resource utilization and system efficiency. • Methods and tools which allow the prediction of the system’s behavior in the space environment, including: predictions of availability, dependability, fault rates/types, and system level performance.
Radiation Environments • Traditionally microelectronics have been designed and manufactured specifically for use in radiation environments • Some COTS microelectronic manufacturing process yield components that are partly resistant to radiation effects (tolerant to TID and latch-up immune) • In most cases Single Event Effects are of greatest concern - Resulting in mostly bit flips (SEU) and functional interrupts (SEFIs) • Discrete Simulation for 7 orbits of Xilinx V2 FPGA • Shows trend driven by changes in particle flux • Orbit: 300km perigee, 1400 apogee, 70° inclination Natural Radiation
N-Modular Redundancy • The popular approach for mitigating SEUs is to employ fixed component level redundancy. • This technique can be applied at all levels of the system hierarchy from circuit to box. • One major disadvantage of fixed redundancy is low efficiency and unrealized system capacity. Example N-Mod Redundancy • TMR (Triple Modular Redundancy) • Typically used in COTS-based microprocessor and Xilinx FPGA-based reconfigurable designs. Module 1 Module 2 Module 3 Majority Voter
Current COTS-based space computing/electronics systems use fixed-architecture designs based on brute-force, worst case fault masking techniques. Triple Modular Redundancy (TMR) is typically a hard-wired design approach for Rad Tolerant G4 PPC processors and Xilinx FPGAs The effectiveness and performance (MIPS/W) gains that the COTS device brings is degraded substantially by the use of a fixed design, worst-case redundancy scheme. EAFTC enables the computer subsystem to take advantage of changing orbital environments during a mission life to utilize the COTS processing elements more efficiently as the environment allows. This allows the EAFTC system to adaptively trade performance verses reliability in real time. Adaptive Fault Tolerance EAFTC Based System COTS Processing Components in aReconfigurable Arch Environmental Sensory (Radiation, position) Adaptive Control Algorithms Software Implemented FT
EAFTC Operational Scenario Average MIPS/Watt for EAFTC design • EAFTC exploits the SEU to orbit position relation as well as the variable criticality of system tasks • The fundamental process implemented in the system consists of three steps: • measure the environment and system state • assess the environmental threat to the applications availability • adapt the processing applications configuration (i.e. fault tolerance) to effectively mitigate the threat presented by the environment. • On average more computation can be performed using EAFTC with less energy MIPS per Watt SEU Rates MIPS/Watt for worst case design Orbit Position
Hardware Architecture • APC Cluster • Consist of several APCNodes • Networked togetherwith RapidIO • Adaptive Processing Computer • Reconfigurable based processing node • Multiple modes/configurations • High-performance COTS processor (PPC) • RapidIO network interface • Reconfigurable co-processor • System Controller • Controller for APC Cluster • Hosts EAFTC controller software and other experiment related control software • RadHard processor and interfaces for reliable controller of COTS cluster • SEU Alarm • Provides measure of SEU-inducing flux & particle energy • Used by EAFTC controller to determine real-time threat level to SEUs • Separate heavy ion and proton sensors
EAFTC Application Platform ... • Scientific Application • Application Specific FT • FT Manager • EAFTC Controller • Job Manager Application Programming Interface (API) System Controller Data Processor Policies Configuration Parameters Application FT Lib Co Proc Lib Application Specific Mission Specific FT Control Applications FT Middleware FT Middleware Generic Fault Tolerant Framework OS OS OS/Hardware Specific Hardware Hardware FPGA Network • Local Management Agents • Replication Services • Fault Detection SAL (System Abstraction Layer)
EAFTC Middleware • Provides a high-performance platform for parallel/distributed applications • Cluster and job management to provide a single system view to the application • Message Passing Interface API • Platform abstraction to include OS system calls and hardware registers • Mission Level Customization through policies • Scalable architecture to support clustering of resources on multi-computer system • Reconfigurable co-processors devices for application acceleration • Provides a high-availability platform for applications • An autonomous and adaptive controller for fault tolerance configuration that maintains required dependability and availability while optimizing resource utilization and system efficiency. • Checkpoint and rollback service for application recovery in the event of a fault. • Application level replication services to facilitate reliable deployment of applications in SEU susceptible COTS processing resources • EAFTC Middleware offers numerous benefits as a system platform • Capitalize on cost savings in the use of commercial hardware • Capitalize on latest processing technology through technology refresh • Reduces cost and extends system life through a software-based middleware solution • Scales to meet system requirements • Customizable degree of fault tolerance to meet specific system needs
TRL4 Technology Validation TRL5 Technology Validation TRL6 Technology Validation TRL7 Technology Validation EAFTC Technology Advances to TRL7 Flight Experiment Increasing fidelity and capability TRL7 Validation - Demonstrate EAFTC technologies in a real space environment - Validate predictive models and predictive model parameters with experiment data - TRL7 experiments will be identical to those performed and rung out during TRL6 demonstration and validation TRL6 Validation - Demonstrate enhanced EAFTC technologies in a laboratory environment on prototype flight hardware including exposure to radiation beam - Validate and refine predictive models and predictive model parameters with experiment data - complete set of canonical fault injection experiments TRL4 Validation - Demonstrated basic EAFTC technologies in a laboratory environment on COTS hardware testbed including radiation source and sensor - Environment Sensor - Alert Generator - High Availability Middleware - Replication Services NASA adds requirement for fault tolerant cluster and MPI capability TRL5 Validation - Demonstrate basic EAFTC technologies in a laboratory environment on testbed hardware with partially integrated Fault Tolerance Services - Develop predictive models - Validate and refine predictive models and predictive model parameters with experiment data - partial set of canonical fault injection experiments
EAFTC Model Flow • Inputs: • Orbit • Epoch • Radiation • characterization • of components • System • architecture • HW architecture • Inputs: • Decomposed HW Architecture • Comprehensive Fault Model Canonical Fault Model Particle fluxes, Energies, & component SEE effects Canonical fault types • Inputs: • Mission application • characterization and constraints • Peak Throughput per CPU • Number of nodes in cluster • Algorithm/Architecture Coupling • Efficiency for application • Network-level parallelization • efficiency • Measured OS and FT Services • overhead • Measured execution times for • applications Rad Effects Model Canonical fault types HW SEU Susceptibility Model Model Fault rates for each fault type in the canonical fault model (ln) Availability & Reliability Models • Inputs: • Probability that fault effects application • Detection coverage for each fault/error type • in the canonical model • Recovery coverage for each fault/error type • in the canonical fault model • Detection and recovery latencies for each fault • Number of mode change types and rates • Time to effect mode change • Probability that mode change is successful Availability & Reliability Performance Model Delivered Throughput Delivered Throughput Density Effective System Utilization
TRL4 EAFTC System Technology Demonstration • Successful demonstration of EAFTC system • The EAFTC prototype comprises key technology elements • Cluster Computer • Autonomous Controller • Replication Services • Environment input is simulated via SPENVIS radiation models • Instrumentation for power utilization is included in the model • Profiling is integrated on Data Processors for cpu utilization measurement • Workload is provided via synthetic benchmark application on Data Processors
Computer Capacity Experiment TMR 3 node system EAFTC 4 node system • average power: 72 Watts • average system effective MIPS: 973 MIPS • average system efficiency: 13 MIPS/Watt • average power: 97 Watts • average system effective MIPS: 2661 MIPS • average system efficiency: 28 MIPS/Watt Comparison: 35% increase in power consumption, 173% increase in effective MIPS, and 115% increase in efficiency
TRL5 Platform • Consists of 4 Data Processors implemented with COTS Single Board Computers (SBCs) and PCI Mezzanine Cards • SBCs will implement a PPC 750FX microprocessor running the Linux operating system and a Software Fault Injectors for fault simulation. • The PMCs will implement a Xilinx Virtex2 FPGA that will serve as the co-processor for its host SBC • The System Controller will be implemented with a software development unit of our flight SBC. • All nodes in the cluster will be interconnected via a GigE switch. • A Development Workstation will be used for software development, experiment control, and instrumentation data collection. • Software Implemented Fault Injection (SWIFI) will be the primary method for simulating faults. Other methods may be used such as manual node resets, network traffic fault injections (via software or hardware fault injection methods), and test port inserted faults
New Millennium Program Space Technology 8 • New Millennium Program • NASA program for technology development • Currently working on its 8th technology development program • In Formulation phase to evaluate 4 subsystem technologies (one of them EAFTC) • The objective of the NMP ST8 EAFTC mission is to validate EAFTC technology at TRL7 through experimentation in space. • SSR 7/05 • PDR 5/06 (TRL5) • CDR 5/07 (TRL6) • Launch 12/08 (TRL7 after 6 month on-orbit experiment) • Our team’s overall goal is to demonstrate that EAFTC is a competitive and low-risk solution for missions needing COTS high-performance on-board payload processing. • We will demonstrate that by using EAFTC we can maximize and significantly improve the performance of a COTS based computer in orbit.
Summary • EAFTC is an enabling technology for high performance spacecraft computing. • As part of our NMP sponsored efforts a TRL4 system has been demonstrated • Efforts continue towards a TRL5 system demonstration.