640 likes | 717 Views
A Brief History of Configurable Computing Machines. Brad L. Hutchings Brigham Young University Presented at MAPLD’99. ENIAC. ENIAC Vital Statistics. 19,000 Tubes 1,500 Relays 70,000 Resistors 10,000 Capacitors 5,000,000 soldered joints 40 Racks Cost: $486,804.22 200,000 Watts.
E N D
A Brief History ofConfigurable Computing Machines Brad L. Hutchings Brigham Young University Presented at MAPLD’99
ENIAC Vital Statistics • 19,000 Tubes • 1,500 Relays • 70,000 Resistors • 10,000 Capacitors • 5,000,000 soldered joints • 40 Racks • Cost: $486,804.22 • 200,000 Watts
Programming ENIAC Conceptual Design Start Over Determine general organization Block Diagram Design modifications Manual Routing Re-wire Manual partition, place and route. weeks Test DOES NOT WORK Execute
Speedup Application Computer Differential Analyzer ENIAC 60-sec trajectory calculation 72,000 sec. 900 sec. 30 sec. speedup 1 80 2400
Was ENIAC really a configurable computer? • Coarse function units • Function controlled by front panel switch. • Addition, • Subtraction • Multiplication/division • Square root • Sequencing via interconnect: • Function units connected with cables. • Programming ENIAC: • programming functional units • Interconnecting functional units (place and route?)
ENIAC was Configurable • Fixed-function units were manually wired. • Trunk-lines were crossbars. • Match structure to the problem. • ENIAC was later hacked to be sequential (for ballistic tables): • local programming was configurable model. • central programming was stored-program. • performance was lower. • programming much easier.
ENIAC-Chip Vital Statistics • 0.5 micron process • CMOS NWell • Triple Metal • 174,569 Transistors • All function-unit interconnections are programmable crossbars. • Reminiscent of FPGA architectures.
What did we learn from ENIAC • Manual place and route took too long. • ENIAC was later hard-wired for ballistic calculations. • Stored-program concept was much more convenient. • You could really build something out of 18,000 tubes.
Fixed + Variable Structure Estrin (1960) General-Purpose Computer Supervisory control Variable Structure
Variable Structure Layout Hardwired Module •Arithmetic •Flip-Flop •Memory Motherboard (plug-board)
Developing Applications • Step 1: Manually wire up the machine to include the desired hardware. • Plug modules, • Wire up backplane. • Step 2: Compiler automatically maps code to hardware.
Wiring Estrin’s Machine Conceptual Design Start Over Determine general organization Block Diagram Design modifications Manual Routing Re-wire Manual partition, place and route. Days? Test DOES NOT WORK Execute
Projected Speedups Speed Gain Flip-Flops Arithmetic Units High-Speed Memory Number Sieve 1000 340 - - Eigenvalues 4 364 1 16K Dynamic programming 2.5 80 1 - Parabolic, partial Differential eq. 3.5 380 2 500 Log-exp-nth power 8.5 125 1 1024 Exponential 6.0 125 1 1024 nth power 7.0 - - - Trig-inverse trig. 4 200 1 256 Random-number generator 4 40 1 (serial) -
What did we learn from Estrin? • Special-purpose computers have speed advantages. • 1960’s Technology was not sufficient to implement Estrin’s ideas.
Modern System Architectures • Mix of configurable and fixed-function devices. • Memories, interconnect, crossbars. • Connected to a I/O device. • I/O device manages data collection. • Configurable system is loosely-coupled. • Application-implementation strategy: • Entire application on configurable platform. • 90/10 analogy.
InterfaceBoard Sparc Host DMA XL X1 X2 X3 X4 X5 X6 X7 X8 DMA XR X16 X15 X14 X13 X12 X11 X10 X9 X1 X2 X3 X4 X5 X6 X7 X8 X16 X15 X14 X13 X12 X11 X10 X9 Splash Architecture (‘92) Splash Array Board X0 Crossbar X Processing Element Splash Array Board 256Kx16 RAM X0 Crossbar 16 20 XC4010 36 36 36
Splash2 Vital Statistics • 17-272 Xilinx 4010 FPGAs (10K gates) • 17-272 256K x 16-bit memories • Systolic data path • Extensible • Programmed using commercial tools. • Comprehensive VHDL model of entire system. • Debugging capability. • Cost: ???
SplashDevelopment Approach Conceptual Design Start Over Determine general organization Manual Partition Rewrite VHDL Write VHDL code for each FPGA. Synthesize Resynthesize Synthesize VHDL files to XNF. 30 minutes to hours Xilinx Backend DOES NOT FIT Place and route XNF. 30 minutes to hours DOES NOT WORK Download & Test Execute on Splash.
Speedups Application VAX 6620 CM-2 SPLASH2 Edit-Distance 1.0 M CUPS 5.9M CUPS 3,000M CUPS speedup 1 5.9 3,000
1MB 1MB 1MB 1MB DECPeRLe-1 Architecture (‘92) XC3090 XC3090 adrN adrN adrE adrE In XC3090 XC3090 XC3090 XC3090 XC3090 Out XC3090 XC3090 XC3090 XC3090 XC3090 XC3090 XC3090 adrS XC3090 XC3090 XC3090 XC3090 adrW XC3090 XC3090 XC3090 XC3090 Host adrW XC3090 adrS
DecPerle-1 Vital Statistics • 20, 3090 FPGAs. • 4, 1MB memories • High-speed host I/O. • Programmed with custom CAD software: PAMDC • Debug capabilities • Cost: $15K
DECPeRLe-1 DECPeRLe-1 Development Approach Conceptual Design Start Over Determine general organization Manual Partition Rewrite C++ Write C++ code for each FPGA. Manually place circuitry. Generate XNF. Xilinx Backend DOES NOT FIT Place and route unplaced XNF. 30 minutes to hours Download & Test DOES NOT WORK Execute on DECPeRLe-1.
Speedups Application Cray-II DecPerle Long-integer arithmetic 4 Mbs 66 Mb/s speedup 1 16 Application Anything else DecPerle RSA Cryptography 185 kb/s speedup 1 < 10
HP Teramac Vital Statistics 1 FPGA 27 FPGAs 108 FPGAs 1728 FPGAs Chassis (8-16 Boards) ~1 million gates 0.5 Gigabytes RAM PCB (4 MCMs) MCM PLASMA FPGA
TeramacDevelopment Approach Conceptual Design Start Over Determine general organization (get a bigger teramac) Write VHDL Write VHDL code for the entire system. Generate Verilog netlist. Translate to Teramac .opt file. ??? Teramac Backend DOES NOT FIT Partition, place and route opt file. 30 Minutes Download & Test DOES NOT WORK Execute on Teramac
Speedup Application Workstation DecPerle 3-D Convolution speedup 1 < 50-100x
Comparing the 3 systems: Debug • All supported symbolic debug • Splash, DecPerle-1, TeraMac: • Users could access FPGA state via original signal names. • Teramac supported checkpoint, restart and breakpoints. • DecPerle-1 supported dynamic animation of system state.
Comparing the 3 systems: Design • DecPerle-1 • Carefully used custom tools and optimized to achieve highest performance. • Splash-2 • Used commercial CAD but achieve less performance. • TeraMAC • Completely automatic. Lowest performance and utilization. Most expensive platform.
What did we learn? • In-circuit debug has a tremendous impact. • Both custom tools and commercial CAD have their place. • Simple architectures are easier to map to. • If architecture is too complex, hide behind tools. • Comprehensive simulation models are essential. • More memory (ports) is good.
Other lessons... • You really could build something with nearly 1K FPGAs (TeraMAC). • Interfacing with Sparcs (SBUS) can be difficult (Splash-2). • The engineers lied (so says Brian Schott). • CAD tools got in the way then and they get in the way now.
More Lessons • The run-time environments for the platforms haven’t changed. • The programming environments still haven’t changed much.