630 likes | 777 Views
reconfigurbale / fpga hpc computing in 2014
E N D
Reconfigurable Computing Reconfigurable Computing Roberto Innocente inno@sissa.it Part 1 of 2 May 10, 2014 R.Innocente 1
Flexibility - + ? ASIC Application Specific Integrated Circuit GPP General Purpose Processor Reconfigurable Hardware Flexible, But enough energy, time and space efficient Very inflexible,designed to solve just 1 problem. Energy, space and time efficient Very flexible, can solve any problem. Energy, space and time inefficient May 10, 2014 R.Innocente 2
History May 10, 2014 R.Innocente 3
Gerald Estrin/1 is credited the idea of having proposed in the '60 the first reconfigurable (F+V) FIX+Variable computer Gerald Estrin. ACM 1960. Organization of computer systems: the fixed plus variable structure computer. May 10, 2014 R.Innocente 4
Gerald Estrin/2 He envisioned that important gains in performance could be achieved when many computations are executed on appropriate problem oriented configurations. F+V is made of : - high speed general computer(the F part) : initially an ibm7090 - various size high speed special structures (the V part) problem specific: trigonometric functions, logarithms, exponential, n-th powers, complex arithmetic, … V is made of a 36 module positions motherboard which can undergo : - Function reconfiguration: physically changing some modules - Routing reconfiguration : changing part of the back wiring The Rammig machine (1977) : investigation of a reconfigurable machine with no manual or mechanical intervention May 10, 2014 R.Innocente 5
Today reconfigurable hardware Is born out of the will to replace different logic IC(Integrated Circuits), and successively to rapidly prototype large ASICs(Application Specific ICs) or implement SoCs (Sytem On Chip). In the following slides readers are supposed to be involved in scientific computing and not EE engineers. May 10, 2014 R.Innocente 6
Basic digital circuits AND INVERTER OR MUX D Type FF Shift Reg Usually 0=0V, 1=some positive voltage May 10, 2014 R.Innocente 7
SSI 74xx IC May 10, 2014 R.Innocente 8
PLD Inconvenience of standard discrete logic circuits : - 14 pin packages of 4/6 logic functions - often you had to traverse the PCB to find a free OR or inverter - if you needed only a few, you had in any case to put an IC with 4/6 Therefore came the idea of PLD (Programmable Logic Device) : - SPLD (Simple : PAL/PLA) - CPLD (Complex) In which a simple interconnection network could be configured melting some internal fuses (fuse technology) to implement combinatorial logic. May 10, 2014 R.Innocente 9
disjunctive normal form (aka Sum of products ) Each boolean function of some boolean variables can be represented as a sum of minterms (product of all variables or their complement) . With 3 boolean vars : a,b,c are 2 of the 23 = 8 minterms f (a,b,c)=ā bc+̄ ab̄ c ābc,̄ ab̄ c May 10, 2014 R.Innocente 10
PLA (Programmable Logic Array) f1=p1+p2+p3=x1x2+x1 ̄ x3+ ̄ x1 ̄ x2 x3+x1x3 May 10, 2014 R.Innocente 11
FPGA Also CPLDs showed their limits, therefore in 1985/1990 Xilinx introduced a more flexible design , the FPGA (Field Programmable Gate Array) In which the interconnection network is much more flexible and on which also sequential circuits can be easily mapped. May 10, 2014 R.Innocente 12
FPGA idea 1985 Xilinx – Ross Freeman (inventor of FPGA): “What if we could develop the equivalent of a circuit board full of standard logic parts (like TTL and PAL devices) on a single high density programmable logic chip ?” - post fabrication programmability by end users - fabless semiconductor company May 10, 2014 R.Innocente 13
Today May 10, 2014 R.Innocente 14
FPGA market Dominated by 2 players : - Altera - Xilinx From sourcetech411(2010) From 67% of 2010, today they share together 90% of the market (4.5 billion usd revenues in 2012) May 10, 2014 R.Innocente 15
An important question: are FPGAs green ? Virtex-7 2000T (one of the top FPGAs) : ~ 20 W CPU : ~ 100 W Core i7-4770K Haswell (22 nm) 3.5 GHz@ 4 Cores 84 W Core i7-3930K Sandybridge-E (32 nm) 3.2 GHz @6Cores 130 W Xeon E7458 Dunnington (45 nm) 2.4 GHz 90 W Xeon E7460 Dunnington (45 nm) 2.66 GHz 130 W Xilinx showed 3600 copies of its 8 bit processor nanoblaze running on Virtex-7, consuming 20 W GPU : ~ 220 W Nvidia Tesla M2090 225 W Nvidia Tesla K20X 235 W This is a partial answer. We need to be able to estimate FPGA performance to give a more useful index. May 10, 2014 R.Innocente 16
FPGA architecture From RF and Wireless World Sea of gates : logic blocks are like islands in a sea of interconnections May 10, 2014 R.Innocente 17
Virtex family 1998 Virtex 250nm 100mhz 25k-60k cells 2000 Virtex-E 180nm 300mhz 1k-70kcells From L Zhuo 2000 Virtex II 150nm to168 mult420mhzupto 93k 4-luts 2005 Virtex-4 90nm 500mhz upto 200k cells 2007 Virtex-5 65nm 550mhz up to 330k cells Virtex-6 40nm 288-2k DSP to 500k 6-luts 2010 Virtex-7 28nm ~500mhz upto 2000k cells 2014 Virtex-US 20 nm upto 4400k cells Up to ~ 7 billion transistor Intel 2014 15-core Xeon IvyBridge-EX~ 4.3 billion transistor Nvidia 2012 GK110 Kepler ~ 7 billion transistor May 10, 2014 R.Innocente 18
FPGA/CPU evolution May 10, 2014 R.Innocente 19
Virtex-7 is not monolithic 2.5 D technology : 4 FPGA tiles with silicon interposer that provides 10k Interconeections between layers May 10, 2014 R.Innocente 20
Enabling technologies May 10, 2014 R.Innocente 21
Programming technology/1 Disordered except at very low range Antifuse SRAM OTP(One time programmable) Pass transistor in switch block May 10, 2014 R.Innocente 22
Programming technology/2 Antifuse -pros: cheap, small -cons: requires special processing, One time programming SRAM -pros: can be deployed with standard semiconductor process, can be easily reprogrammed -cons: large area required(6 transistors) May 10, 2014 R.Innocente 23
Confware The configuration of an FPGA ( that becomes compiled to a stream of bits) is not hardware, nor software. Someone invented the neologism confware The configuration of a reconfigurable hardware. May 10, 2014 R.Innocente 24
How you configure an FPGA ? SRAM cells as a long shift register : loaded serially clocking in the confware Virtex 7 2000T = 440 Mbits of SRAM cells (simplified : large fpgas can also parallel load the confware) May 10, 2014 R.Innocente 25
Logic Blocks/Logic Cells May 10, 2014 R.Innocente 26
Fine/coarse grain logic blocks From : - a single transistor (Crosspoint : went in bankrupcy) - a logic gate To : - a complete processor (FPNA: field programmable node arrays) NB. FPNA is also field programmable neural array May 10, 2014 R.Innocente 27
CLB(Configurable Logic Blocks) Homogeneous : - Logic Cells: 4 input LUT(LookUp Table) + FlipFlop Heterogeneous(modern development) : - Logic cells - DSP (Digital Signal Processing) - Memory blocks - I/O blocks Necessary differentiation to allow things like multiplication/addition to be mapped in an efficient way. The heterogenous architecture is prevalent now. The blocks are configured by SRAM bits usually loaded trough serial ports as already pointed out. May 10, 2014 R.Innocente 28
Standard Logic Cell 16 bits of SRAM for conf 1 bit SRAM conf 4 input LUT D type FlipFlop 2:1 Mux May 10, 2014 R.Innocente 29
standard LUT (Look Up Table) - 16 x 1 memory Dec Bin Out - any boolean function of 4 inputs : 0 0000 0 Bit 3 1 0001 1 2 0010 0 Bit 2 3 0011 0 4 0100 1 Bit 1 5 0101 0 6 0110 1 7 0111 1 NB. LUT rhymes with nut Bit 0 .. .. .. f = ̄ x3 ̄ x2 ̄ x1 x0+ ̄ x3 x2 ̄ x1 ̄ x0+ ̄ x3 x2 x1 ̄ x0+ ̄ x3 x2 x1x0 May 10, 2014 R.Innocente 30
Uses of Logic Cell 2^4 = 16 x 1 bit memory Any boolean function of 4 inputs 4:1 multiplexer May 10, 2014 R.Innocente 31
Virtex-7 Logic Block basics May 10, 2014 R.Innocente 32
Virtex-7 Logic slice From Xilinx 4 x 32=128 bit shift reg May 10, 2014 R.Innocente 33
Virtex7 CLB slice - 6-input LUT - 2 5-input LUTs with same inputs - 2 arbitrary boolean function on 3-input and 2-input or less May 10, 2014 R.Innocente 34
Altera ALM May 10, 2014 R.Innocente 35
Interconnection network May 10, 2014 R.Innocente 36
Interconnection network Hierarchical routing Island type routing(predominant) Nearest neighbours Interconnection network can consume 80% of the area of an FPGA ! May 10, 2014 R.Innocente 37
Programmable switch May 10, 2014 R.Innocente 38
SRAM routing: coarse/fine grain 5 bit SRAM 1 bit SRAM May 10, 2014 R.Innocente 39
Details of island type routing May 10, 2014 R.Innocente 40
Disjoint/Wilton switch blocks Disjoint : wire can only go out on wire of same number, creates routing domains Wilton : can change domain in at least one directions May 10, 2014 R.Innocente 41
Channel segments distribution May 10, 2014 R.Innocente 42
Columnar architecture 7 series Xilinx fpga Columnar architecture May 10, 2014 R.Innocente 43
DSP blocks & floating point May 10, 2014 R.Innocente 44
FPGAs floating point in 1994 B. Fagin and C. Renard. Field Programmable Gate Arrays and Floating Point Arithmetic. IEEE Transactions on VLSI Systems, 2(3), September 1994. Fagin & Renard report that you can implement floating point operators but it is impractical : no FPGA in existence could contain a single multiplier circuit !! May 10, 2014 R.Innocente 45
FPGA fp in 1995 Shirazi & al. On the same line of Fagin & Renard propose 2 custom fp formats 16 and 18 bits total: they provide for them add,sub, mul, div operators N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1995. May 10, 2014 R.Innocente 46
FPGA fp in 2002 Belanovic & Leeser present a library of variable width parameterized floating point operators (superset of the ieee formats) A Library of Parameterized Floating-point Modules and Their Use Pavle Belanovic and Miriam Leeser, 2002 May 10, 2014 R.Innocente 47
What allowed the breakthrough ? The addition, by major vendors, of hardware multipliers (called DSP blocks) on their FPGA from 2000 on : - 1st Xilinx on Virtex II - soon after Altera on Stratix This started in the last decade also the interest of HPC community : Cray XD1, Silicon RASC, Convey HC1 HPRC = High Performance Reconfigurable Computing May 10, 2014 R.Innocente 48
FPGA MAC operation May 10, 2014 R.Innocente 49
Virtex-7 DSP48 high level 1 bit 2 bit From Xilinx May 10, 2014 R.Innocente 50