140 likes | 260 Views
Let’s Open Up New Fields for Next 10X!. Koji Inoue Kyushu University, Japan. We Need More Performance! But…. High performance is exactly required more! Supercomputing, Desktop, Laptop, Cellar Phone, Home Games, and so on But, power is a SERIOUS problem! Device Reliability Grobal Warming.
E N D
Let’s Open Up New Fieldsfor Next 10X! Koji Inoue Kyushu University, Japan ISLPED'08@Bangalore
We Need More Performance! But… • High performance is exactly required more! • Supercomputing, Desktop, Laptop, Cellar Phone, Home Games, and so on • But, power is a SERIOUS problem! • Device Reliability • Grobal Warming Total Power Consumption in Japan 250 200 150 100 50 12X • Need drastic improvement! • Not Incremental! Billions KW/h 5X 2006 2025 2050 http://www.meti.go.jp/press/20071207005/03_G_IT_ini.pdf ISLPED'08@Bangalore
How? • Fundamental concept for low-power has matured in “Our Current Field”! • DVS, DVFS, Selective Activation, Exploiting Prediction, and so on… • Suggestion • Move to a new (or strange) field! • Orchestrate computing resources effectively! ISLPED'08@Bangalore
New Fields! • Revisit the system stack from circuit/architecture level to algorithm level on new fields! Superconducting rapid single-flux-quantum (SFQ) New Reconfigurable Devices 3D-IC Implementation with Wireless Links Yokohama National University Nagoya University Advanced Industrial Science and Technology Keio University ISLPED'08@Bangalore
The Case for SFQ-RDP Project 10 TFLOPS Desk-top Computer RSFQ with new architecture Kyushu University, Yokohama National University, Nagoya University, SRL/ISTEC ISLPED'08@Bangalore
Superconducting rapid single-flux-quantum (SFQ) : Device Level Energy-delay products Superconducting wire Ballistic propagation Bit energy [J] MCM developed by SRL ISLPED'08@Bangalore Gate delay [s]
Large-Scale Reconfigurable Data-Path for SFQ : Architecture Level • 1K FPUs operate at 80 GHz • Re-configurable operand network • Much simple organization for SFQ design (No feedback loops) • Make a good balance between “Parallel Exe. Vs. Sequential Exe.” ISLPED'08@Bangalore
How To Exploit A Number of FPUs: Compiler Level Application Program • Large DFGs are extracted from source codes • They are executed in pipeline fashion • SFQ-RDP is used as an “Efficient Accelerator” DFG Extractor (w/ source level optimization) Mapping ISLPED'08@Bangalore
How To Exploit A Number of FPUs: Algorithm Level Computation of molecular orbital while (I < 1000): tei(4,4,4,4)=(((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(0,t))/(p**2*q**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*f(1,t))/(p*q*(p+q))(4*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*f(1,t))/(p*q*(p+q))(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q*(p+q)**2)+(2*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p*q**2*(p+q)**2)+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(((p+q)*f(1,t))+2*p*PQx**2*q*f(2,t)))/(p**2*q*(p+q)**2)+(4*(3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*PQx*(QCx+QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)\+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(q*(p+q)**3)(4*(PAx+PBx)*PQx*(3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)*f(2,t)+2*p*PQx**2*q*f(3,t)))/(p*(p+q)**3)+((3+2*p*(4*PAx*PBx+PBx**2+PAx**2*(1+2*p*PBx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q**2*(p+q)**4)(8*(PAx+PBx)*(3+2*p*PAx*PBx)*(QCx+QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(q*(p+q)**4)(8*(PAx+PBx)*(QCx+QDx)*(3+2*q*QCx*QDx)*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*(p+q)**4)+(4*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p*q*(p+q)**4)+((3+2*q*(4*QCx*QDx+QDx**2+QCx**2*(1+2*q*QDx**2)))*(3*(p+q)**2*f(2,t)+4*p*PQx**2*q*(3*(p+q)*f(3,t)+p*PQx**2*q*f(4,t))))/(p**2*(p+q)**4)(4*p*(PAx+PBx)*(3+2*p*PAx*PBx)*PQx*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(q*(p+q)**5)+(8*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*PQx*(QCx+QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(4*PQx*q*(QCx+QDx)*(3+2*q*QCx*QDx)*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p*(p+q)**5)(8*(PAx+PBx)*PQx*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**2*f(3,t)+4*p*PQx**2*q*(5*(p+q)*f(4,t)+p*PQx**2*q*f(5,t))))/(p+q)**5+(8*(PAx+PBx)*(QCx+QDx)*(15*(p+q)**3*f(3,t)+30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))8*p**3*PQx**6*q**3*f(6,t)))/(p+q)**6+(2*(3+p*(PAx**2+4*PAx*PBx+PBx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(q*(p+q)**6)+(2*(3+q*(QCx**2+4*QCx*QDx+QDx**2))*(15*(p+q)**3*f(3,t)30*p*PQx**2*q*(p+q)*(3*(p+q)*f(4,t)+2*p*PQx**2*q*f(5,t))+8*p**3*PQx**6*q**3*f(6,t)))/(p*(p+q)**6) 787 MUL, 261 ADD, 69 FUNC I ++: ISLPED'08@Bangalore
Koji’s Message(from Core to Data-Center) • Move to new fields! • Orchestrate computing resources effectively! • Efficient acceleration (Parallelization, Specialization) • Make a good balance between many things (Concurrency Management) ISLPED'08@Bangalore
Thanks!! ISLPED'08@Bangalore
New Fields! • Revisit the system stack from circuit/architecture level to algorithm level on emerging devices! ISLPED'08@Bangalore
4.2 K SFQ 0.5um process CMOS CPU (1chip) ORN 2TB memory module (FB-DIMM [DDR3@1333MHz, 128GB] ×16 modules) ... FPU SFQ RDP (32FPU×32chips) (4GFLOPS/FPU) ORN : : : : ... ORN ... ORN SB 1024FPU@MCM (34chips)×4MCM : : : ... : SMAC SMAC SMAC Memory band width per MCM:256GB/s (=16GB/s ×16 channels) ISLPED'08@Bangalore