180 likes | 313 Views
BG/L Hardware Bringup. Burkhard Steinmacher-Burow for the Bringup Team. Title slide. Outline. Introduction to BG/L Hardware Bringup Some Bringup Techniques and Approaches: Schedule Preparation, Preparation, Preparation People The Hardware Team
E N D
BG/L Hardware Bringup Burkhard Steinmacher-Burowfor the Bringup Team Title slide
Outline • Introduction to BG/L Hardware Bringup • Some Bringup Techniques and Approaches: • Schedule • Preparation, Preparation, Preparation • People • The Hardware Team • BG/L Online - Making First Hardware Available • Pursuit, Communication and Tracking of Issues • Cycle-Reproducibility in HW and Simulation • Some Lessons Learned • Summary
Introduction to BG/L Hardware Bringup Large multi-site team led by Alan Gara. Bringup of first 1-node through 65536-node BG/L machines. Aggressive schedule. Some dates: June 16, 2003 – First BG/L compute chips at Watson. Riscwatch program load and PPC440 execution. June 18, 2003 – BlueMatter FFT on single node. June 24, 2003 – Sys.Sw. runs on single node: Linpack, STREAMS, NAS. July 30, 2003 – LLNL review of single- and multi-node results. Nov 2003 – CEO milestone: 1TFlop LINPACK on 512 nodes. Nov 2003 – Release RIT2 of BG/L compute chips to manufacturing. Feb 2, 2004 – First RIT2 chips arrive at Watson. Early 2005 – 65536 node BG/L running applications at LLNL. Techniques The remainder of this talk. Focus on past 1-512 node bringup, not future 1024-65356 nodes. Basic text slide
Bringup Schedule Bringup schedule is dictated by availability of chips, because: • Chip is the fastest changing technology.In other words, Moore’s Law halves value of BG/L every 18 months. • Chips are BG/L’s costliest area to iterate: • About $1Million per iteration. • More than 3 months per iteration. Since schedule is gated by chips: • At each bringup stage,require other hardware and software ready in advance. • Preparation, preparation, preparation. Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight
Preparation, Preparation, Preparation Example preparations before chip arrival: • BG/L hardware, minus chips, up and running, including JTAG path. • Link chip hw 9months earlier, so had mature network signaling. • Simulation, simulation, simulation.Much effort spent validating and verifying chip design: • Software regression suite of ~180 tests driving full chip simulation. • Bus functional model stressing memory and networks. • Aggressive multinode torus and tree testbenches. • Low-level chip startup via JTAG. • Much effort on timing, physical design, …, manufacturing. • Greenlight and other scenarios helped ensure that all sw is in place. To compete against COTS Linux clusters, need working components! Subheading: 20pt Arial Italicsteal R045 | G182 | B179 Highlight: 18pt Arial Italic,teal R045 | G182 | B179 Text slide withsubheading and highlight
People in BG/L Hardware Bringup Hardware team divided into subteams: Here list each person in only one subteam. Most at Watson. • Packaging: Paul Coteus, Todd Takken, Rick Rand, Gerry Kopcsay, Randall Bickford, Rich Swetz, Paul Crumley, Thomas Cipolla, Shawn Hall, Larry Mok • Bringup SW Environment: Dong Chen, Ralph Bellofatto, Mark Giampapa. • Test Interface: Ruud Haring, Marc Dombrowa, Sarabjeet Singh. • Memory System: Martin Ohmacht, Dirk Hoenicke, Ben Nathanson, Valentina Salapura. • Network Signalling: Minhua Lu, Al Gara. • Torus: Phil Heidelberger, Pavlos Vranas, Sarabjeet Singh. • Tree and Global Interrupt: Matthias Blumrich, Dirk Hoenicke, Lurng-Kuo Liu, Burkhard Steinmacher-Burow. • PowerPC440+FPU: Jim Sexton.
People in BG/L Hardware Bringup (cont. 1) Additional participants in BG/L Hardware bringup: • Blue Matter Team: TJC Ward, Blake Fitch, Alex Rayshubski, … • Applications Team: John Gunnels,Gyan Bhanot, Bob Walkup, … • System Software Team: Jose Moreira, George Almasi, … • Continuity with Rochester: Fariba Kasemkhani, Tom Liebsch, Karl Solie. • 440 and FPU: Ken Dockser (Raleigh). Jim Goldade(Rch). • LBIST/ABIST: Steve Douskey, Jim Daily (Rochester). • Cronus: Chris Engel (Rch). • DDR Controller: Jim Marcella (Rch). • Clocks: Matt Ellavsky, Bruce Rudolph (Rch). Balaji Gopalsamy (Bangalore). • RiscWatch: Nabil Rizk, Anthony Marsala (Raleigh). • FIB Chips: Marsha Abramo (Burlington). • Torus and Tree Networks: Narasimha Adiga, Arun Umamaheshwaran, Krishna Desai (Bangalore), Mickey Tsao, Mike Wazlowski, Brett Tremaine. Whether at Watson or elsewhere, same access to BG/L hardware!!! See later section on ‘BG/L online’.
People in BG/L Hardware Bringup (cont. 2) Additional groups supporting BG/L Hardware bringup: • Raleigh 440 team. • Watson CSS board assembly and rework. • Watson IT group on ethernet issues. • IGS on infrastructure. • Austin, Rochester, Poughkeepsie on AIX/LL/DFS issues. • Watson metal shop. • Watson physical plant. • . . . From all groups, fast support helped/critical to meet bringup schedule!! Example: Watson CSS handled large peaks of assembly volume and had quick turnaround on rework or change requests. Beyond bringup, additional people and groups for design, physical design, manufacturing, ….
The Hardware Team On the job from start to finish: • Those who proposed the project and set the goals, schedule and resources, were given responsibility to complete it. This creates a huge level of drive and commitment, in all phases including bringup. • Hardware team covers all phases: application studies, architecture, high level simulation, design, verification, bringup, application porting.This allowed: • Bringup benefits from deep BG/L expertise. For example: • Determining appropriate goals and resources for an area of BG/L. • Bringup assistance is designed into BG/L hardware. Some examples: • Cycle-reproducibility described in a later section. • 3 levels of loopback on torus and tree for fault isolation. • Error injection: torus and tree links, memory to test ECC. • Backdoors into devices.
The Hardware Team (cont.) Members have very similar view of BG/L: • Goals, architecture and rest of view described in Jan 2001 Program Plan. • With similar view, can quickly reach technical consensus.Required for fast bringup and fast other phases. Trust individuals with responsibility over an area. • Work required is best recognized by those doing the work. • Trust is good, control is better: Also had lists of lists. Many members are involved in many several areas of BG/L. • Many eyes helps avoid fatal hardware errors. • Helps ensure reasonable tradeoffs across areas. For example, resource use.
BG/L Online – Introduction • Staggered chip delivery first allowed a few 2-node ‘service stations’, then a few 32-node cards, and eventually a 512-node midplane at Watson. • IBM intranet provides access to the nodes.Same access whether you are at Watson or in Rochester or in India.Truly the same, since physical access is very rarely required. • Bringup performed simultaneously by multiple largely-independent teams. • Hardware logically partitioned and allocated to the different bringup efforts. Allocation changes at daily bringup meetings.
BG/L Online – The Hardware A tremendous success of the packaging group! • Shortly after a chip enters Watson,it is powered, configured and online. • Version changes of cards, power converters and other packaginghave not been visible to users.
BG/L Online – The Bringup Software Environment All access through proxy server: • Intranet – proxy – ethernet – iDo_chip – JTAG. • Supports entire spectrum of tools: • RiscWatch. • Cronus. • Hardware host console, including fast boot tools and mailboxes. Blade • microkernel and programming environment to conveniently configure and expose BG/L nodes for diagnostic codes. Critical: Fast changes to investigate and resolve issues!
BG/L Online – The End Result RIT1 hardware good enough for: • System software development. LLNL milestones.BlueMatter code development. • Tests on RIT1 HW to release RIT2 to manufacturing. HW Bringup present and Future: • Accumulated above suite of HW test code. • If all goes well, little new code required to verify and validate RIT2. • Evolving subset of codes used as manufacturing diagnostics.- Aim to catch all card-level fails.- Aim to categorize fails to identify problem areas of chip. • Evolving subset of codes used as scaling diagnostics:- Identify the (intermittently) broken component among 65536 nodes. • In other words: • Worrying less about inevitable remaining design errors. • Worrying more about chip and component failures.
Pursuit, Tracking and Communication of Issues Hardware team relentlessly pursues every anomaly to root cause: • no matter how trivial or innocent. “Leave no thread unpulled!” • Non-understood hardware does not move until resolved! Hardware team carefully tracks every issue in issues database. Related effort automatically captures anomalies in RAS database. Issues are widely communicated, since having more people involved: • Increases chance of finding best resolution. • Increases chance of recognizing related new issues. • Decreases chance of chasing the same bug twice.
Cycle Reproducibility in HW and Simulation A single node code on BG/L Hardware is cycle reproducible: • All components leave reset cycle-reproducibly. • Ping-pong reset routine synchronizes reset across components,in particular the 2 PPC440 cores and the DRAM refresh. When randomly searching, hardware hits bugs much faster than software: • Node simulation is on the order of 1Million cycles per hour. • HW is 700Mcycles/second * 512 nodes in parallel = 1Billion times faster. Hunt for bugs by random search application: • Application loops around ping-pong reset.This minimizes relatively lengthy application load time. • Within loop execute for only a few million cycles.This keeps simulation time down to a reasonable few hours. A bug in simulation is essentially solved: • Have all values of all signals and powerful tools to investigate them. Heavily used to investigate BG/L memory system!
Some Lessons Learned Would be nice to have: • Better coordination of potential hardware issues arising from applications and system software running on early hardware.Hard to solve due to: • Diverging goals: • HW people worry about missing an issue. • SW and application people want working hardware. • Difficulty quickly identifying hw or sw as root cause. • Better hw and sw support for small few-node partitions. • BG/L Online allowed many teams to simultaneously access early hardware.So hardware was often allocated in small partitions. • No major design errors in silicon. • Had 1 in memory system.Solution: New more aggressive simulation of memory unit.
Summary Enthusiastic efforts from many non-BG/L IBMers crucial for quick BG/L bringup. • Thanks! State of BG/L Bringup: • Linear scale:Have brought up 512 of 65536 nodes.In other words, less than 1% of the final machine. • Log scale:Have brought up 29 of 216 nodes.In other words, more than half the final machine.