10 likes | 236 Views
slt. slt. psel. psel. slt. sll. psel. lw. addu. copy. slt. sll. psel. lw. slt. psel. addu. mul. sll. sll. sll. slt. sll. lw. lw. lw. psel. addu. lw. copy. copy. copy. slt. slt. psel. psel. mul. slt. addu. slt. psel. sll. addu. psel. lw. slt. slt. psel.
E N D
slt slt psel psel slt sll psel lw addu copy slt sll psel lw slt psel addu mul sll sll sll slt sll lw lw lw psel addu lw copy copy copy slt slt psel psel mul slt addu slt psel sll addu psel lw slt slt psel psel slt mul psel sll addu lw addu addu addu slt sll slt slt slt Scale Microarchitecture psel lw psel psel psel slt mul copy slt slt slt psel psel psel psel addu slt sll sll sll sll lw psel lw lw lw copy slt psel mul mul mul sll lw addu addu addu copy addu slt slt slt slt psel psel psel psel slt slt slt slt sll psel psel psel Crossbar Arbiter lw psel Synchronous Memory Interface Asynchronous Interface to Host Control Processor addu copy HTIF Vector-Fetch slt sll Thread Fetch Thread Fetch Thread Fetch Thread Fetch VP 0 psel VP 1 VP 2 VP N lw sll slt lw psel XMEM copy addu mul Control Processor slt sll psel lw addu Pending Fetch Pending Fetch Pending Fetch Pending Fetch Vector Memory Request Registers Registers Registers Registers Vector Thread Unit VLU VSU VRU Cross VP Network Non-Blocking Cache AIB Fill Unit C0 C1 C2 C3 C2 C1 C0 C3 C2 C0 C3 C2 C1 C1 C3 C0 8KB SRAM Bank VMU CMU SD MSHR TAGS Lane 0 Vector Memory Data VP Memory Ops Memory 8KB SRAM Bank VMU CMU SD MSHR TAGS Lane 1 The vector-thread (VT) programming model includes a control processorwhich manages an array of virtual processors. A virtual processor contains a set of registers and has the ability to execute RISC-like instructions. VP instructions are grouped in atomic instruction blocks (AIBs), the unit of work issued to a VP at one time. AIBs can be sent to the VPs using either a vector-fetchwhere the AIB is broadcast to all VPs, or a thread-fetch where a VP fetches its own AIB. The VT programming model also includes vector memory operations which efficiently move data between all of the VPs and memory, as well as VP memory operations which allows an individual VP to perform its own independent loads and stores. Finally, a cross VP network enables a VP to send data directly to the next VP. 8KB SRAM Bank VMU CMU SD MSHR TAGS Lane 2 8KB SRAM Bank VMU CMU SD MSHR TAGS Lane 3 Data Parallel Loop with No Conditionals ADPCM Decode Execution Example CP Vector Thread VP1 VPN VP0 The VT model enables efficient execution of pure vectorizable code. Traditional vector machines need to use vector registers to store all intermediate values. AIBs allow VT to use low-energy temporary state (such as bypass latches) for intermediate results. In contrast to threaded machines, VT can make use of efficient vector memory accesses and fine grain synchronization. vector loads ld ld ld C Code Lane 0 Lane 1 Lane 2 Lane 3 for(i=0;i<N;i++) { x = A[i] + B[i] C[i] = x >> 15 } ld ld ld void decode(int len, u_int8_t* in, int16_t* out) { int i; int index = 0; int valpred = 0; for(i = 0; i < len; i++) { u_int8_t delta = in[i]; index += indexTable[delta]; index = index < IX_MIN ? IX_MIN : index; index = IX_MAX < index ? IX_MAX : index; valpred += stepsizeTable[index] * delta; valpred = valpred < VALP_MIN ? VALP_MIN : valpred; valpred = VALP_MAX < valpred ? VALP_MAX : valpred; out[i] = valpred; } } C0 C1 C2 C3 C0 C1 C2 C3 C0 C1 C2 C3 C0 C1 C2 C3 vector fetch + + + >> >> >> vector store st st st Register File Execute Directive Queue Data Parallel Loop with Conditionals Scale Code vtu_decode_ex: .aib begin c0 sll pr0, 2 -> cr1 # word offset c0 lw cr1(sr0) -> c1/cr0 # load index c0 copy pr0 -> c3/cr0 # copy delta c1 addu cr0, prevVP -> cr0 # accum index c1 slt cr0, sr0 -> p # index min c1 psel cr0, sr0 -> sr2 # index min c1 slt sr1, sr2 -> p # index max c1 psel sr2, sr1 -> c0/cr0, nextVP # index max c0 sll cr0, 2 -> cr1 # word offset c0 lw cr1(sr1) -> c3/cr1 # load step c3 mulh cr0, cr1 -> c2/cr0 # step*delta c2 addu cr0, prevVP -> cr0 # accum valpred c2 slt cr0, sr0 -> p # valpred min c2 psel cr0, sr0 -> sr2 # valpred min c2 slt sr1, sr2 -> p # valpred max c2 psel sr2, sr1 -> pr0, nextVP # valpred max .aib end Control Logic Vector Thread CP VP1 VPN VP0 AIB Traditional vector machines usually use store masks (or possibly predication) to handle loops with conditional control flow. Although predication is also available in the VT model, one can also use conditional thread-fetches to avoid executing unused instructions. Efficient outer-loop parallelization is enabled by simply using a vector-fetch for the outer loop, and thread-fetches for the inner loop. Inner loops with data dependent exit conditions are also possible. Datapath Cache vector loads ld ld ld for(i=0;i<N;i++) { x = A[i] if (B[i] == 1) x = x * 37 C[i] = x } ld ld ld Shifter vector fetch Adder Although the algorithm includes many iteration independent instructions, it also includes two loop carry dependencies which must be executed serially. Each iteration executes over a period of 45 cycles, but the combination of multiple lanes, cluster decoupling within each lane, and the cross VP network, leads to as many as six loop iterations (VPs) executing simultaneously. == == == Single Scale Cluster The datapath, register file, and execute directive queue have all been preplaced using our custom framework. Although we were able to use the Artisan memory compiler to generate an SRAM for the AIB cache, there was no suitable memory compiler for the cluster’s 2 read / 2 write port register file. With preplacement we were able to make use of special latch standard cells and hierarchical tri-state bit-lines to create a reasonable register file implementation using purely static logic. We used a similar technique for the various CAM arrays in the design. br br br x x vector store st st st Loop with Cross Iteration Dependencies Implementation and Verification Tooflow CP Vector Thread VP1 VPN VP0 Traditional vector machines can usually only handle a few special cross iteration dependencies and threaded machines usually need to pass these values through memory. VT can use the fine grain cross VP network to pass values to the next iteration of the loop; the execution of the next iteration waits until the data is received from the previous iteration. The control processor initializes the cross VP network and reads the final result. vload Gate-Level C++ PreplacementCode ld ld ld x = 0 for(i=0;i<N;i++) { y = A[i] + x x = y >> 15 B[i] = x } xvpwrite vector fetch 70 C++ Microarchitectural Models 9 10 + + + Golden Verilog RTL Standard Cell Preplacement Framework 8 60 >> >> >> 8 7 xvpread 50 6 vstore 6 st st st 40 5 Directed and Random Test Programs 4 30 4 Tenison VTOC Verilog to C++ Synopsys Design Compiler Synthesis Free Running Threads 3 20 2 2 10 1 CP VP1 VPN VP0 Vector Thread Artisan SRAMs and Standard Cells 0 0 0 Host PC VT includes atomic memory operations to allow the VPs to run fully threaded code. For example, the control processor might push data into a shared work queue, and each VP then uses atomic operations to pull work off the queue and process it. Each VP uses thread-fetches for independent control flow and VP loads and stores to access memory. PWR CLK rgbcmyk rgbyiq hpg adpcm.dec sha dither lookup pntrch qsort C++ ISA Simulator Functional Sim C++ RTL Simulator 2-State RTL Sim Cadence First Encounter FP /Clk / Pwr / P&R vector fetch FPGA Scale for(i=0;i<N;i++) { spawn(&thread) } thread() { while (1) { y = atomic(x|1) if (y == 0) { count++ atomic(x&0) } } } PLX Host Interface atm.or atm.or atm.or br br br SDRAM (96MB) Memory Controller ld atm.or atm.or Compare Test Memory Dumps Synopsys VCS 4-State RTL Sim Synopsys Formality Verify RTL vs Gates Mentor Graphics Calibre DRC / LVS / RCX Host Test Baseboard Daughter Card ++ br br st atm.or atm.or atm.and fetch br br Synopsys Nanosim Transistor Sim ld atm.or atm.or ++ br br st atm.or atm.or atm.and The Scale Vector-Thread Processor Computer Science and Artificial Intelligence Laboratory Ronny Krashinsky, Christopher Batten, Krste Asanović http://cag.csail.mit.edu/scale Vector-Thread Architecture Scale Prototype Chip in TSMC 0.18μm Standard Cell Preplacement Chip Statistics Initial Results We have developed a C++ based preplacement framework which manipulates standard cells using the OpenAccess libraries. Preplacement has several important advantages including improved area utilization, decreased congestion, improved timing, and decreased tool run time. The framework allows us to write code to instantiate and place standard cells in a virtual grid and programmatically create logical nets to connect them together. The framework processes the virtual grid to determine the absolute position of each cell within the preplaced block. We preplaced 230 thousand cells (58% of all standard cells) in various datapaths, memory arrays, and crossbar buffers and tri-states. After preplacing a block, we export a Verilog netlist and DEF file for use by the rest of the toolflow. Although the preplaced blocks do not need to be synthesized, we still input them into the synthesis tool so that the tool can correctly optimize logic which interfaces with the preplaced blocks. During place & route we use TCL scripts to flexibly position the blocks around the chip. Although we preplace the standard cells, we do the datapath routing automatically; we have found that automatic routing produces reasonably regular routes for signals within the preplaced blocks. Initial results for running the adpcm.dec benchmark from on-chip RAM with no cache tag checks Clock and Power Distribution Microarchitectural Simulation Results 1 Lane 2 Lanes 4 Lanes The chip includes a custom designed voltage-controlled oscillator to enable testing at various frequencies. Scale can either be clocked by the on-chip VCO or by an external clock input. The clock tree was automatically synthesized using Encounter and the maximum trigger-edge skew is 233ps. Scale uses a fine-grained power distribution grid over the entire chip as show in the above diagrams. The standard cells have a height of nine Metal 3/5 tracks and the cells get power/ground from Metal 1 strips which cover two tracks. We horizontally route power/ground strips on Metal 5, directly over the Metal 1 power/ground strips, leaving seven Metal 3/5 tracks unobstructed for signal routing. We vertically route power/ground strips on Metal 6. These cover three Metal 2/4 tracks and are spaced nine tracks apart. The power distribution uses 21% of the Metal 6 and 17% of Metal 5. 8 Lanes Kernel Speedup Chip Test Platform The chip test infrastructure includes a host computer, a test baseboard, and a daughter card with a socket for Scale. The test baseboard includes a host interface and a memory controller implemented on a Xilinx FPGA as well as 96MB of SDRAM, configurable power supplies, and a tunable clock generator. The host interface is clocked by the host and uses a low-bandwidth asynchronous protocol to communicate with Scale, while the memory controller and the SDRAM use a synchronous clock generated by Scale. Using this test setup, the host computer is able to download and run programs on Scale while monitoring the power consumption at various voltages and frequencies. To allow us to run real programs which include file I/O and other system calls, a simple proxy kernel marshals up system calls, sends them to the host, and waits for the results before resuming program execution. Kernel speedup is compared to compiling the benchmark and running it on the Scale control processor. Mem-B is average number of bytes of L1 to main memory traffic per cycle. Ld-El and St-El are number of load and store elements transferred to and from the cache per cycle. Loop types include: data parallel loops with no control flow (DP), data parallel loops with control flow or inner loops (DC), loops with cross-iteration dependencies (XI), and free running threads (FT). Memory access types include: unit-stride and strided vector memory accesses (VM), segment vector memory accesses (SVM), and individual VP loads and stores (VP). This work was partially supported by a DARPA PAC/C award, an NSF CAREER award, an NSF graduate fellowship, a CMI research grant and donations from Infineon Technologies and Intel. We acknowledge and thank Albert Ma for designing the VCO and providing extensive help with CAD tools, Mark Hampton for implementing VTorture and the Scale compilation tools, Jaime Quinonez for the baseline datapath tiler implementation, Jared Casper for work on an initial cache design and documentation, and Jeffrey Cohen for initial work on VTorture.