230 likes | 407 Views
S tream C omputations O rganized for R econfigurable E xecution. SCORE. Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS group André DeHon California Institute of Technology – Dept. Computer Science. http://brass.cs.berkeley.edu/SCORE/.
E N D
StreamComputationsOrganized forReconfigurableExecution SCORE Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John WawrzynekUniversity of California, Berkeley – BRASS groupAndré DeHonCalifornia Institute of Technology – Dept. Computer Science http://brass.cs.berkeley.edu/SCORE/
Goal: Software Survival • Software for microprocessors survives on new devices • Binary compatibility • Automatic improvement • Software for reconfigurable devices does not • Substantial effort to port/redeploy FPL 2000 (8/30/00)
Outline • Problem: Software Survival • A New Compute Model • SCORE Components • Preliminary Results • Future Work FPL 2000 (8/30/00)
Why Can’t Reconfig. Software Survive? • Resource constraints/sizes are exposed: • to programmer • in low-level representation (netlist) • Design revolves around device size • Algorithmic structure • Exploited parallelism FPL 2000 (8/30/00)
The SCORE Approach • A compute model with unbounded resources • Efficient hardware virtualization • Demand paging FPL 2000 (8/30/00)
Page-Compatible Devices • Family of devices with: • Common page definition • Varying number of pages • Binary Compatibility • Automatic Performance Improvement FPL 2000 (8/30/00)
Page Execution Execute time Reconfigure Virtualizing a Netlist (is bad) • Netlist is sensitive to timing • Disallow asynchronous features (e.g. busses) • Synchronous • WASMII [Ling+Amano, FCCM ’93] • Page I/O via registers • Execute each cycle of every page • Hugereconfigurationoverhead! FPL 2000 (8/30/00)
Previous Attempts at Virtualization • Multi-context • DPGA [DeHon, FPGA ‘94] • TM-FPGA [Xilinx, FCCM ‘97] • Configuration Cache • Striped • PipeRench [CMU, FPGA ’98] • Pipelined reconfiguration • Restricted to feed-forward pipelines FPL 2000 (8/30/00)
Stream is: • Unidirectional page-to-page link • FIFO queue of data tokens • Unbounded depth Streams • Goal • Less frequent reconfiguration • Batch process block of inputs • Amortize reconfiguration cost over large data set FPL 2000 (8/30/00)
Stream Implementation • Only one endpoint (page) loaded • Stream = memory buffer • Desire distributed, on-chip memory • Both endpoints (pages) loaded • Stream = wire FPL 2000 (8/30/00)
DCT Zig-zag DCT Zig-Zag Quantize / ZLE Quantize / ZLE HuffmanEnc. Huffman Enc. Execution Example: Spatial FPL 2000 (8/30/00)
Quant / ZLE Huffman Enc. DCT Zig-zag Execution Example: Time-Multiplexed FPL 2000 (8/30/00)
Graph-based Compute Model Scheduler Run-time Support Hardware Support SCORE Components FPL 2000 (8/30/00)
SCORE Compute Model • Computation = graph of compute nodes • Concretely: compute pages • Abstractly: operators with local state (FSM) • Communication = streaming data flow • Storage = • Streams • Memory segments,accessed through streams FPL 2000 (8/30/00)
SCORE Hardware Model • Paged FPGA • Compute Page (CP) • Fixed-size slice of RC hardware • Fixed number of I/O ports • Distributed, on-chip memory • Configurable Memory Block (CMB) • Stream access • High-level interconnect • Microprocessor • Run-time support + user code FPL 2000 (8/30/00)
SCORE Run-Time Support • Mechanics of run-time reconfiguration • Page swap [context save/load] • Reconfigure interconnect • Page Scheduling • Which page to run where, when • Static … Dynamic FPL 2000 (8/30/00)
.25: 12.9mm2 (1/9 of PII-450) .18: 6.7mm2 (1/16 of PIII-600) Functional Simulation • FPGA based on HSRA [Berkeley, FPGA ’99] • CP: 512 4-LUTs • CMB: 2Mbit DRAM • Area for CP-CMB pair: • Page reconfiguration: 5000 cycles (from CMB) • Synchronous operation (same clock speed as processor) • x86 microprocessor • Page Scheduler task • Swap on timer interrupt (every 250,000 cycles) • Fully dynamic scheduling FPL 2000 (8/30/00)
Application Pages Segments JPEG Encode 13 6 Decode 13 4 MPEG Encode 45 102 Wavelet Encode 14 6 Decode 15 6 Applications • Multimedia processing applications • Hand-partitioned into 512-LUT pages • Good applications • Primarily feed-forward (feedback loops fit in HW) • Bad applications • Large, tight feedback loops (e.g. ADPCM) FPL 2000 (8/30/00)
Application: JPEG Encode FPL 2000 (8/30/00)
Scaling Results: JPEG Encode Total Time (Makespan in millions of cycles) Physical Compute Pages FPL 2000 (8/30/00)
Summary • SCORE enables software survival on reconfigurable systems • Binary compatibility • Automatic performance scaling • Virtual Hardware • Requirements: • Graph-based compute model • Paged FPGA hardware • Run-time support for RTR/Scheduling FPL 2000 (8/30/00)
Future Work • Compilation/CAD • Partitioning FSM operators into pages • Study architectural parameters • Page size • CMB size • Tolerable reconfiguration time • Scheduling • Static scheduling FPL 2000 (8/30/00)
More Info on the Web • SCORE project: • http://brass.cs.berkeley.edu/SCORE/ • Tutorial: • http://brass.cs.berkeley.edu/documents/ score_tutorial.html FPL 2000 (8/30/00)