Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers

Polymorphic Processors:How to Expose Arbitrary Hardware Functionality to Programmers Stamatis Vassiliadis Computer Engineering, EEMCS, TU Delft http://ce.et.tudelft.nl Member of HiPEAC

… 50% ASIC 20% Very Large The limitation: B • Techniques: • ILP • pipeline • technology 10X Potential Zero Execution (PZE) introduced in 87-88 and published in IBM Journal of R&D 94 Timewise we execute two instructions (50% code elimination) 2X 0.5 0.9 PZE and the Amdahl’s law program Max speedup = 2.0 Excluding start-up reduced 5 cycles to 3 speedup 1.6 83% efficiency Why polymorphic? We can ride the Amdahl’s curve easier and faster

predictive coding compression (ZIP) Original image Filtered image • Goal: get image with more 0’s bitstream • Is it possible?: spatial redundancy (adjacent pixels often have same values => many differences between them =0 ) Transmission: decoding Research questions: UNZIP Original image Filtered image • What does Paeth means in terms of computations? • Can I put it on hardware? • What is my gain? bitstream Motivating example Paeth coding

c b a Paeth(d)= one ofa,b,c, which is closest to initial prediction p = a+b-c Original Filtered Paeth p=a+b-c Filtered=Original-Paeth =4 - 4 =0 0 0 0 0 0 0 0 0 0 0 0 0 pb=|p-b| pa=|p-a| pc=|p-c| pa<=pb? pb<=pc? pa<=pc? 0 3 0 0 0 3 3 3 0 0 3 3 0 0 1 0 0 3 4 4 0 3 3 4 1 0 c b 1 0 a d 0 0 0 0 0 3 4 5 0 3 4 4 c=3, b=3 a=4, d=4 p =4+3-3=4 Paeth(d)=a=4 1 0 Paeth area:…………… 6 8-bit adders Motivating example

C-code Altivec code What it does CSI code bptr = prev_row+1; dptr = curr_row+1; predptr= predict_row+1; for(i=1; i < length; i++){ c = *(bptr-1); b = *bptr; ... .... ... if(...) *predptr = a; else if (..) else *predptr = c; ...... bptr++; } initialize li r5, 1 csi_mt_scr r1, SCR1, 0 csi_mt_scr r5, SCR1, 1 ..totally 20 instructions load unpack process csi_paeth predptr, bptr,dptr pack ONE INSTRUCTION For all loop iterations store Looping Altivec code Example: Paeth Prediction (PNG) li r5, 0 ….totally 6instructions loop: lvx vr03, r1 # load c's lvx vr04, r2 # load a's vsidoi vr05, vr01, vr03, 1 # load b's vmrghb vr07, vr03, vr00 # unpack vmrglb vr08, vr03, vr00 # unpack …totally 6 instructions #Compute vadduhs vr15, vr09, vr11 # a+b vadduhs vr16, vr10, vr12 # vsubshs vr15, vr15, vr07 # vsubshs vr16, vr16, vr08 # ..totally 76 instructions #Pack: vpkshus vr28, vr28, 29 # pack #Store: stvx vr28, r3, 0 # store #Loop control addi r1, r1, 16 …….. bneq r7, r0, loop # Loop • Altivec iteration: 95 instructions per 16 pixels. • CSI code : 1 instruction for all iterations (+20 setup instructions) CSI Instruction design : latency: …………. 5 cycles throughput: ………16 pixels/1 cycle ( EUROMICRO99) area:…………… 24 32-bit adders Cycle = 1 ALU operation

Execution time: on 4-issue CPU, with 32 byte-wide CSI unit, normalised to non-CSI execution Dynamic instruction counts, normalised to non-CSI counts Results: Instruction count and execution time reduction Bench: Paeth kernel, 132-element vectors (132 pixels in a row)

Programming paradigm (HW and SW descriptions coexisting in a program) Processor architecture (behavior + logical structure) Compilation Microarchitecture New kind of tools Research Questions Motivating example: Obvious observations NO way I can do this on fixed hardware I can do this if the hardware changes functionality at my wishes. EASIER SAID THAN DONE ! I have to answer the following: How can I identify the code for hardware implementation? How can I implement “arbitrary” code? Is the hardwired code substituted by new instructions? How can I substitute this code with SW/HW descriptions say at the source level? How can I automatically generate the “transformed” program?

Program P’ DATA Outline Program P GPP RH  A MEM FPGA What to do: RESULTS • Identify the “” code • Show hardware feasibility of “” in FPGA Tools Microarchitecture Architecture Programming Paradigm Compiler • Map “” into reconfigurable hardware (RH) • Eliminate the identified code • Add code to have “equivalent” behavior • Compile new program • Execute MOLEN Introduce reconfigurable microcode (- code) Specific code in hardware left to the programmer/hardware designer One time 8 new instructions for any ISA Co-processor paradigm (e.g. vector) New register file for parameter passing Sequential consistency Split-join parallelism Function like code

Code Architecture … Re- targeted Compiler int fact(int n) { if(n<1) return n else return(n*fact(n-1)); } Binary Code call f(.) HDL … XILINX VIRTEX-II PRO FPGA IBM PowerPC HDL f(.) Tool Chain New Program where Hardware/software descriptions co-exist Human Directives C2C Critical ? NO A U T O L I B hand coded YES

The MOLEN ISA Divide RC into two logical phases “SETEXECUTE address” “function” independent No new op-codes Implementation and ISA independent Reconfigurable design (two instructions) Parameter passing: twonew instructions + Register file Execute on reconfigurable One instruction Arbitrary number of parameter passing Speeding up: reconfiguration and execution Twoinstructions for prefetching Parallel execution : split via a Molen instruction and join via a GPP instruction or onespecial instruction Modularity: by implementing at least the minimal MOLEN instruction set and by reconfiguring to it. Total: 8new instructions ( SAMOS‘03)

Instruction Set Partitioning 8 instructions grouped in 6 instruction categories: partial SET (P-SET) Complete SET (C-SET) SET < address > EXECUTE < address > MOVTX and MOVFX. SET PREFETCH < address > EXECUTE PREFETCH < address > BREAK: Minimal Preferred Complete

In parallel no data dependency Sequence Control Example h: mov a -> r1 movtx r1 ->XR2 mov b -> r2 movtx r2 ->XR3 mov c -> r3 movtx r3 -> XR4 set address_set_op1 set address_set_op2 ldc 2 ->r4 movtx r4 ->XR0 ldc 4 ->r5 movtx r5 ->XR1 execute address_ex_op1 execute address_ex_op2 movfx XR2 -> r6 mov r6 -> m movfx XR4 -> r7 mov r7 -> n #pragma call_fpga op1 int f( int x, int y) { … } #pragma call_fpga op2 int g(int x) { … } int h(int a, int b, int c) { int m,n, ...; m=f(a, b); n=g(c); …… }

MicroProgram On-chip storage Permanently stored From memory Permanently stored Reconfigurable Microcode Storage Frequently used FIXED Less frequently used PAGEABLE Frequently used • Fixed on-chip storage for frequently used microcode • Pageable on-chip storage for less frequently used microcode ( IEEE MICRO‘03)

 R/P CS- /  H CS- (fixed) CS-, if present (pageable) = reconfigurable unit (CCU) from execution hardware Residence Table SEQUENCER CSAR  set FIXED PAGEABLE FIXED PAGEABLE  CS- = reconfigurable unit (CCU) to execution hardware execute microinstruction CONTROL STORE - The -code unit Determine next microinstruction M I R

memory 00: load values into adder 01: shift_ins 02: add_ins 03: shift2_ins 04: SKIP 05: BACK 06: store 07: end_op Resident (0); Pageable (1) end_op Control Store address (CS-); Memory address () More on Architectural support Instruction format An example microprogram: • located in memory starting at address  • address  point to first microinstruction • terminated by an end_op instruction word OPC address 

Arbitrates (redirects) instructions between GPP and RP The arbiter also controls the loading of microcode CCU has direct access to the data memory X registers to exchange parameters between GPP and RU -unit controls CCU by microinstructions The MOLEN -coded processor (FPL’01)

The Molen Prototype Molen prototype implemented on Virtex II Pro Molen machine organization

Trivial HW costs The Prototype Features A VHDL model has been synthesized for Virtex II Pro technology • 64KBytes data and 64KBytes instructions (on-chip) mems; • 64-bit data memory bus; • 64-bit instruction memory bus; • 64 bitsmicrocode word length; • 32MBytes, memory segment for microprograms; • 8Kx64-bit -control store using Dual Port Block RAMs (BRAM); • 512x32-bit XREGs implemented in BRAMs. • Three clock domains: • PPC clock – 250MHz; • MEM clock – 83 MHz; • User clock – external. Utilization of FPGA resources (no CCU): ( FCCM 04 )

Compiling for the Molen C application FCCM Compiler File_n.c MAIN.c SUIF frontend MOLEN extension Machine SUIF backend framework alpha backend x86 backend

SUIF + MachineSUIF Molen extension ISA extension (SET/EXEC) PowerPC backend Register extension (XRs) The Molen Compiler • IBM PowerPC 405 GPP in Virtex II Pro • Register file extension (XRs) • ISA extension ( FPL 03-04 )

Code for a “function” • Example: C code: res = alpha(param1, param2); HW movtx XR1 ← param1 movtx XR2 ← param2 Send parameters HW reconfiguration set <address_alpha_set> HW execution exec <address_alpha_exec> movfx res ← XR3 Return result

Original code Modified code mrk 2, 14 mov $vr2.s32 <- main.z movtx $vr1.s32(XR) <- $vr2.s32 ldc $vr4.s32 <- 7 movtx $vr3.s32(XR) <- $vr4.s32 set address_op1_SET ldc $vr6.s32(XR) <- 0 movtx $vr7.s32(XR) <- vr6.s32 exec address_op1_EXEC movfx $vr8.s32 <- $vr5.s32(XR) mov main.x <- $vr8.s32 main: mrk 2,13 ldc $vr0.s32 <- 5 mov main.z <- $vr0.s32 mrk 2, 14 ldc $vr2.s32 <- 7 cal $vr1.s32 <- f(main.z, $vr2.s32) mov main.x <- $vr1.s32 mrk 2, 15 ldc $vr3.s32 <- 0 ret $vr3.s32 .text_end main Sequence Control Example Code generation: C code #pragma call_fpga op1 int f(int a, int b){ int c,i; c=0; for(i=0; i<b; i++) c = c + a<<i + i; c = c>>b; return c; } void main(){ int x,z; z=5; x= f(z, 7); }

SAD16 SAD128 SAD256 DCT IDCT carphone 6.5 18.9 22.2 302.3 24.4 claire 8.3 23.9 28.2 302.2 24.4 container 12.2 35.2 41.5 302.1 24.4 tennis 12.1 35.0 41.2 302.1 32.3 MPEG-2 encoder MPEG-2 decoder SAD16 SAD128 SAD256 DCT IDCT IDCT carphone 1.76 1.94 1.95 1.14 1.01 1.94 claire 1.90 2.06 2.08 1.13 1.01 1.56 container 2.07 2.20 2.21 1.12 1.01 1.63 tennis 2.22 2.40 2.41 1.10 1.01 1.65 The Experiment (hand tuned HW) Step 1. Obtain MPEG-2profiling data on a PowerPC system Step 2. Measurethe kernels speedups on the prototype: Step 3. Overall speedup per kernel

The MOLEN prototype speeds the MPEG-2 codec up between 93% and 98% of the theoretically max. attainable speedups. SAD TSE Performance gain Recall Smax a Implem. in reconf. Theoretically attainable MAX = 0 Speedup DCT 3.2 3.0 2.8 MPEG-2 Encoder 2.6 Measured experimentally 0.65 0.67 0.68 0.71 a Real vs. Theoretical Speedups Step 4. Application speedup Time

35.1 33.7 91.5 54 mpeg2enc Instruction Counts 137 million 46 million

M-JPEG (HWAutomatically Generated ) • M-JPEG multimedia benchmark • DCT * hardware implementation • Molen prototype ( FPL 04 )

Performance MJPEG 2.5 speedup

Conclusions • We have shown a new: • microarchitecture • processor architecture • programming paradigm • compilation • We have shown that it is easier and faster to ride the Amdahl’s curve with polymorphic processors!

Contact information Computer Engineering Laboratory: http://ce.et.tudelft.nl MOLEN homepage: http://ce.et.tudelft.nl/MOLEN Personal homepage: http://ce.et.tudelft.nl/~stamatis OVERVIEW Paper: The Molen Polymorphic Processor IEEE Transactions on computers NOV 04

Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers

Polymorphic Processors: How to Expose Arbitrary Hardware Functionality to Programmers

Presentation Transcript

Computing System Element Choices

Chapter 5 Array Processors

Hardware Design

Introduction of Intel Processors

Polymorphic Malware Detection

A+ Guide to Hardware: Managing, Maintaining, and Troubleshooting, Sixth Edition

The Microarchitecture of FPGA-Based Soft Processors

COSC2767: Object-Oriented Programming

High Performance Computing on Heterogeneous Networks

Hardware Description Language

2-Hardware Design Basics of Embedded Processors

Status Report

A+ Guide to Hardware: Managing, Maintaining, and Troubleshooting, 5e

2-Hardware Design Basics of Embedded Processors (cont.)

Polymorphic SARPs

High-Performance Hardware Monitors to Protect Network Processors from Data Plane Attacks

Chapter Overview

Assembly Language

Embedded Hardware

Lecture 10: Processors

Multiscalar processors