680 likes | 976 Views
reconfigurable/fpga computing part 2<br>use of fpga in hpc computing
E N D
Reconfigurable computing Roberto Innocente inno@sissa.it Part 2 of 2 May 10, 2014 R.Innocente 63
FPGA lingo May 10, 2014 R.Innocente 64
Core Core in FPGA lingo is a function ready to be instantiated into your design as a “black box”. It can be suppliad as HDL or schematic. It supports design re-use. May 10, 2014 R.Innocente 65
Soft/hard cores On FPGAs functional modules can be implemented : - using std FPGA resources(logic blocks, DSPs, memory blocks) : softcores - as an ASIC on the FPGA : hardcores When the manufacturer puts a processor as an hardcore on the FPGA then it sells this as a SoC (Sytem On Chip) : Dual ARM on Zync-7000 chip, PowerPC on Altera FPGA May 10, 2014 R.Innocente 66
IP/open cores The soft attribute is implied. Hardware designs in an HDL(eventually using vendor libraries): - opensource cores : http://opencores.org/ OpenRISC 1000 architecture from the OpenCores community, the Lattice Semiconductor LM32, the LEON3 from Aeroflex Gaisler and the OpenSPARC family from Oracle - proprietary : IP(Intellectual Property) cores Floating point operators, fft, matrix computations May 10, 2014 R.Innocente 67
Commercial offers May 10, 2014 R.Innocente 68
Picocomputing SC6 1U Upto 16 FPGA SC6 4U upto 48 EX-800 EX-600 From PICOCOMPUTING May 10, 2014 R.Innocente 69
Bittware Terabox 16 altera stratix-V From Bittware May 10, 2014 R.Innocente 70
DINIGROUP Cluster of 4 Virtex7 From DINIGROUP May 10, 2014 R.Innocente 71
Dinigroup Cluster 40 Kintex-7 From DINIGROUP May 10, 2014 R.Innocente 72
Maxeler MPC-X Daresbury Lab UK : The dataflow supercomputer will feature Maxeler developed MPC-X nodes capable of an equivalent 8.52TFLOPs per 1U and 8.97 GFLOPs/Watt. May 10, 2014 R.Innocente 73
Convey HC-2 , HC-2ex May 10, 2014 R.Innocente 74
Cray XT5h “Cray introduces an hybrid supercomputer that can integrate multiple processor architectures into a single system and accelerate high performance computing (HPC) workflows. The Cray XT5h delivers higher sustained performance, by applying alternative processor architectures across selected applications within an HPC workflow. The Cray XT5h supports a variety of processor technologies, including scalar processors based on AMD OpteronTM dual and quad-core technologies, vector processors, and FPGA accelerators.” May 10, 2014 R.Innocente 75
CHREC Center for High Performance Reconfigurable Computing UF/BYU/GWU/VTECH May 10, 2014 R.Innocente 76
CHREC Novo-G 384 FPGAs “Novo-G is the most powerful reconfigurable supercomputer in the known world. This unique machine features 192 top-end, 40nm FPGAs (Altera Stratix-IV E530) and 192 top-end, 65nm FPGAs (Stratix-III E260). “ http://www.chrec.org/ (pronounce it as shreck) May 10, 2014 R.Innocente 77
CHREC/2 BLAST like Smith-Waterman computes local alignment of 2 sequences : - Novo-BLAST Novo-G/CHREC implementation : faster, same sensitivity IPC(Isotope Pattern Calculator) of Protein Identification Algorithm : - speed up 52-366 on single fpga, 1259 on 4 fpgas, 3340 on a node(16 fpgas) May 10, 2014 R.Innocente 78
References for Applications May 10, 2014 R.Innocente 79
Linear Algebra for RC Juan Gonzalez and Rafael C. Núñez LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators(JP 2009) DOD funded May 10, 2014 R.Innocente 80
DCT, FFT on FPGAs Digital Signal Processing with Field Programmable Gate Arrays , 3d edition(2007) U.Mayer Baese, Springer Verlag May 10, 2014 R.Innocente 81
MD on FPGA There are many papers about porting Molecular Dynamics algorithms on FPGAs with substantial positive conclusions about experiments on 1-2 FPGAs. But in the last years there is an embarassing comparison with ANTON (Shaw et al.). We cant forget that ANTON is a really huge machine consuming over 100 KW !!!! And is made out of 512 dedicated ASICs at 1ghz! The comparison with some FPGAs consuming 40/60 W is improper. FPGA-Accelerated Molecular Dynamics(2013) M. A. Khan,M. Chiu, M. C. Herbordt May 10, 2014 R.Innocente 82
Neural networks on FPGAs Editors : Omondi , Rajakapse (2006) FPGA implementation of neural networks ANN(Artificial Neural Network) in integer arithmetic performs 40x better than on GPP (old FPGA, 3 generation old) May 10, 2014 R.Innocente 83
Altera Arria 10 May 10, 2014 R.Innocente 84
Arria10 May 10, 2014 R.Innocente 85
Arria 10 variable precision DSP block A+C*D = 2 flop A B CD Altera May 10, 2014 R.Innocente 86
Arria10 estimated sp fp performance - 2 flops per cycle - 1688 fp single precision DSP (GX660) 1688*2 = 3376 flops per cycle 3376 * 0.5 ghz ~ 1.7 Teraflops in single precision May 10, 2014 R.Innocente 87
Hard single prec FP on FPGA ?!? For people that can live with single precision this seems a very attractive new feature. But many think that it is too much a waste of generic resources and claim that what was missing were simpler blocks ! May 10, 2014 R.Innocente 88
Back of the envelope performance estimation May 10, 2014 R.Innocente 89
Back of the envelope performance estimation Given number of - LUTs - FFs - DSPs offered by an FPGA, Today FPGA clocks are ~500Mhz=0.5GHz (unavoidable price for flexibility) 2000 flops per cycle = 1 Teraflops and utilization of resources by operators, estimate the max number of operators that can be implemented on the FPGA May 10, 2014 R.Innocente 90
Xilinx Virtex-7 family Virtex-7 slices : 4 x 6-input LUTs, 8 FFs Virtex-7 DSPs : 48 bits pre-adder, 25x18 multiplier, 48 bits accumulator May 10, 2014 Virtex LUT ~ 1.6 standard LUT R.Innocente 91
Custom precision 17/24 bits floating dsp lut+ff 2 1 0 0 lut ff # 1080 tot dsp tot lut 2160 0 0 0 0 0 0 tot ff 232200 * 103 113 377 90 97 112 104 376 208440 0 0 0 0 0 0 0 0 0 336 0 0 0 + 0 0 369 301 393 1510 1011700 1150620 0 0 0 0 0 0 Tot 2590 2160 1220140 1382820 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 1221600 305400 4 8 2160 2443200 1.6 standard LUTs 1954560 May 10, 2014 R.Innocente 92
IEEE single precision – 32 bits dsp lut+ff lut 3 2 1 0 ff # 700 tot dsp tot lut tot ff 2100 156100 0 0 0 0 50 12950 0 1052120 * 120 160 331 665 103 128 283 629 105 160 331 669 157500 0 0 0 0 0 0 0 0 0 + 2 0 293 500 225 407 327 541 25 15500 1207560 1160 Tot 1885 2150 1221170 1380560 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 305400 4 8 2160 1221600 2443200 1.6 standard LUTs 1954560 May 10, 2014 R.Innocente 93
IEEE double precision – 64 bits dsp lut+ff 11 10 9 0 lut 279 299 356 2317 ff # 196 tot dsp tot lut 2156 0 0 0 0 3 0 tot ff 146216 * 325 371 439 2361 421 456 510 2418 118384 0 0 0 0 0 0 0 0 0 + 3 0 895 989 705 794 945 1029 1 1600 1840 617 1100111 1245106 Tot 814 2159 1220095 1393162 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 1221600 305400 4 8 2160 2443200 1.6 standard LUTs 1954560 May 10, 2014 R.Innocente 94
Virtex UltraScale XCVU440 20nm -sampling out IEEE double precision – 64 bits dsp lut+ff 11 10 9 0 lut 279 299 356 2317 ff # 261 tot dsp tot lut 2871 0 0 0 0 9 0 tot ff 194706 * 325 371 439 2361 421 456 510 2418 157644 0 0 0 0 0 0 0 0 0 + 3 0 895 989 705 794 945 1029 3 4800 5520 1321 2355343 2665778 Tot 1585 2880 2517787 2866004 Virtex Ultra Scale - available resources slices LUT x FF x dsp 6 input ff slice slice LUT 2518560 314820 8 16 2880 5037120 1.6 standard LUTs 4029696 May 10, 2014 R.Innocente 95
Relative power dissipation/1 TDP/peak nominal double fp performance : Intel Q6600 2.4ghz 105W/ 38 gflops = 2763mW/gflops } } Intel Haswell i7-4770K 3.5ghz 84W/ 112 gflops = 750mW/gflops ~30x Intel IvyBridge 3770K 3.5ghz 77W/ 112 gflops = 687mW/gflops Nvidia Tesla M2090 225W/ 666 gflops = 337mW/gflops ~10x Nvidia Tesla K20X 235W/1310gflops = 179mW/gflops Xilinx Virtex-US 20W/ 800gflops = 25mW/gflops C C C o l o l o l u m n u m n u m n 1 2 3 FPGA computing = green computing May 10, 2014 R.Innocente 96
Relative power dissipation/2 mW / Gflops virtex7 tesla k20x tesla m2090 intel i7-3770k intel 4770k Intel 2.4 ghz q6600 0 500 1000 1500 2000 2500 3000 mW May 10, 2014 R.Innocente 97
Gflops per Watt peak nominal double fp performance/TDP : Intel Q6600 2.4ghz 38 gflops/105 W = 0.36 gflops/W } } Intel Haswell i7-4770K 3.5ghz 112 gflops/84 W = 1.33 gflops/W ~30x Intel IvyBridge 3770K 3.5ghz 112 gflops/77 W = 1.45 gflops/W Nvidia Tesla M2090 666 gflops/225 W = 2.96 gflops/W ~10x Nvidia Tesla K20X 1310 gflops/235 W = 5.57 gflops/W Xilinx Virtex-US 800 gflops/20 W = 40 gflops/W C C C o l o l o l u m n u m n u m n 1 2 3 FPGA computing = green computing May 10, 2014 R.Innocente 98
Top green500 list green500_rank total_power Mflops/Watt Year name LX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20x Dell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20 Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20x Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20x Bull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR, NVIDIA K20x Cluster Platform SL390s G7, Xeon X5670 6C 2.930GHz, Infiniband QDR, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x Nitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, Nvidia K20m Adtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000 BlueGene/Q, Power BQC 16C 1.60 GHz, Custom BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.60GHz, Custom BlueGene/Q, Power BQC 16C 1.60GHz, Custom Cluster Platform SL250s Gen8, Xeon E5-2665 8C 2.400GHz, Infiniband FDR, Nvidia K20m Total Cores Name 2720TSUBAME-KFC 5120Wilkes 4864HA-PACS TCA 115984 5720romeo 74358TSUBAME 2.5 3080 15840 3264 4620CSIRO GPU Cluster Xenon Systems 38400SANAM 16384 16384Cetus 16384 16384 16384Vesta 16384 10920HPCC Manufacturer NEC Dell Cray Inc. Cray Inc. Bull SA NEC/HP IBM IBM IBM Country Japan United Kingdom Japan Switzerland France Japan United States Germany United States Australia Saudi Arabia United States United States Poland United States United States United States United States 1 2 3 4 5 6 7 8 9 28 53 79 4,503 3,632 3,518 3,186 3,131 3,069 2,702 2,629 2,629 2,359 2,351 2,299 2,299 2,299 2,299 2,299 2,299 2,243 2013 2013 2013 2012 2013 2013 2013 2013 2013 2010 2012 2011 2012 2012 2013 2012 2012 2013 1,754 Piz Daint 81 923 54 270 56 71 179 82 82 82 82 82 82 237 10 11 12 13 14 15 16 17 18 Adtech IBM IBM IBM IBM IBM IBM Hewlett-Packard May 10, 2014 R.Innocente 99
Power/Energy efficiency May 10, 2014 R.Innocente 100
Power Dissipation A chip is made of millions of CMOS FETs. When input switches, you need to charge the small capacitance : Anyway increasing a lot the frequency, the chip becomes unstable unless you increase also the voltage(leakage). Therefore there is in fact a superlinear behaviour vs f: Ed=1 2CV2 f times a second gives, together with some constant static dissipation : PT=kCV2f +Ps May 10, 2014 R.Innocente 101
Dennard scaling(1974) S3 S2 S = 1.4x faster transistors S = 1.4x lower capacitance Scale Vdd by S => S2 = 2x lower energy S2 = 2x more transistors S 1 Performance scales as S3 = 2.8 while power density stays constant across generations May 10, 2014 R.Innocente 102
Fred Pollack(Intel) famous graph(1999) “New microarchitecture challenges in the coming generations of CMOS process technology” F.Pollack Power density increases !!! In 2004/2005 we hit the power wall => stop frequency increases May 10, 2014 R.Innocente 103
End of Dennard scaling S3 S = 1.4x faster transistors S = 1.4x lower capacitance S2 In submicron technology rigidity in voltage scaling. Power increases by S2 = 2 S S2 = 2x more transistors 1 May 10, 2014 R.Innocente 104
MOS subthreshold current Scaling down geometry you scale down drain voltage to avoid high electric fields and to decrease energy required to switch. You have to scale down also the threshold voltage to sustain the 30% decrease of gate delay. The small voltage swing that remains is not able to completely turn off the transistor. Subthreshold leakage that was ignored in the past can on modern VLSI chips consume up to ½ of the total power. May 10, 2014 R.Innocente 105
Subthreshold leakage May 10, 2014 R.Innocente 106
VT design tradeoff - Low VT for high ON current : IDsat∝(VDD−VT)2 Low VT => high IDS good for ON condition log IDS - High VT for low OFF current High VT => low leakage good for OFF condition Phenomenology : 60-200 mV of VGS swing decreases IDS by one order of magnitude. Today 0.5-0.2 VT doesn't allow the needed swing of VGS to shutoff the transistor. VGS May 10, 2014 R.Innocente 107
Multicore scaling 4-core 8-core 16-core 65 nm 45 nm 32 nm Every generation 2x cores, at same or slightly increasing frequency. May 10, 2014 R.Innocente 108
Multicore scaling at constant frequency S2 S = 1.4x lower capacitance S S2 = 2x more transistors } S = 1.4x lower utilization 1 We hit the utilization wall => dark silicon May 10, 2014 R.Innocente 109
End of multicore scaling 4 cores 5.7 cores 8 cores 65 nm 45 nm 32 nm Every generation 1.4x cores, at same or slightly increasing frequency. Dark or dim silicon (“uncore”) May 10, 2014 R.Innocente 110
Dark silicon and the end of multicore scaling Doug Burger (Microsoft) at HiPEAC 2013 : - till 2004: each semiconductor generation gave transistors smaller, faster and that consume less - from 2004 to now: we still got smaller transistors, but we could not run them faster (power wall) - in the future : we will still get smaller transistors but we will not be able to use all of them together(dark silicon) or at max speed. May 10, 2014 R.Innocente 111
Scaling the utilization wall G.Venkatesh ASPLOS 10 : “while the area budget continues to increase exponentially, the power budget has become a first-order design constraint in current processors. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve the system performance. ”The Utilization Wall : With each successive process generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints. [Venkatesh, ASPLOS ‘10] Single chip heterogeneous computer (E.Chung) Greater energy efficiency combining GPP with unconventional cores (U-cores) : GPU,FPGA,DSP,ASICs .. May 10, 2014 R.Innocente 112