290 likes | 296 Views
Building Fake Body Parts: Digital Mockups. Frank Vahid Univ. of California, Riverside. Chen Huang (UC Riverside, now Amazon) Bailey Miller (UC Riverside, intern at SpaceX) Prof. Tony Givargis (UC Irvine) Ting-Shuo Chou (UC Irvine) Others. Support provided by NSF, SRC, Dept. of Educ.
E N D
Building Fake Body Parts: Digital Mockups Frank Vahid Univ. of California, Riverside Chen Huang (UC Riverside, now Amazon) Bailey Miller (UC Riverside, intern at SpaceX) Prof. Tony Givargis (UC Irvine) Ting-Shuo Chou (UC Irvine) Others... Support provided by NSF, SRC, Dept. of Educ. Also CareFusion, Xilinx, METI
Models of physical world that run in real-time Test cyber-physical systems http://www.nhlbi.nih.gov/ Transducer models Environment model Physical mockup Digital mockup Frank Vahid, UCR 3
Issue: Real-time achieved via inaccuracy “2-3 minutes to simulate one breath accurately” V[1],R[1] Weibel lung complexity 4 gen: 32 ODEs 6 gen: 128 ODEs 8 gen: 512 ODEs 10 gen: 2048 ODEs V[2],R[2] V[7],R[7] Frank Vahid, UCR 4
PC & GPU PC(4) GPU 1522 1490 1430 1184 PC(1) 1000 900 800 700 600 500 Performance (ms) 400 Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x 300 200 100 0 Weibel Neuron Weibel + gas Weibel + hemo Hemodynamic • Parallel computations + • Neighbor communication Seem like great match for FPGAs Frank Vahid, UCR 5
* * * * * * * * * * * * + + + + + + + + + + + + FPGA Processor Processor Processor FPGAs: Sw circuits (parallel) C Code for FIR Filter Circuit for FIR Filter • 1000’s of instructions • Several thousand cycles for (i=0; i < 128; i++) y += c[i] * x[i] .. .. .. for (i=0; i < 128; i++) y[i] += c[i] * x[i] .. .. .. • ~ 7 cycles (though slower clock) • Speedup > 10x-100x
2x2 switch matrix y w a b c 0 1 SM SM SM SM SM SM 0110 1100 z 0 LUT LUT x 1 SM SM SM SM SM SM a b 0000 1111 00 01 10 11 1 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 D E a b FPGAs “101”(A Quick Intro) FPGA SM LUT 4x2 Memory 1 0 a1 a0 00 01 10 11 11 a b 11 0 d1 d0 0 F G F G a b c 1 1 1 0 1 1 0 0 0 0 0 0 0 0 1 0 D E
HLS PC(4) GPU HLS / FPGAs 1522 1490 1430 1184 PC(1) 1000 900 800 700 600 500 Performance (ms) 400 Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS/FPGA: 3.2x 300 200 100 0 Weibel Neuron Weibel + gas Weibel + hemo Hemodynamic High-level synthesis: Compiler that converts program to circuits Frank Vahid, UCR 8
Network of synchronized PEs on FPGAs • General Processing Element • Iterative ODE solver (Euler/RK4) • 0.1 ms / 0.01 ms timestep PE 1 PE: 300 MHz FPGA Digital mockup PE PE Frank Vahid, UCR 9
Synthesis tool Phase 10K iterations 150K iterations Maps ODEs to virtual PEs using simulated annealing 1 Convert virtual PEs to physical circuits using FPGA place-route 2
General PEs General PEs 1522 PC(1) 1490 1430 PC(4) GPU HLS 1184 1000 900 800 Speedup vs real-time PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PEs: 4.9x 700 600 Performance (ms) 500 400 300 200 100 0 Weibel Neuron Weibel + gas weibel + hemo Hemodynamic Frank Vahid, UCR 11
Problem: More PEs Lower frequency FPGA DSP INST MEM DATA MEM Internal PE critical path Inter-PE critical path FPGA Lost ODEs/sec due to freq drop Real ODEs/sec 11-gen Weibel model, Virtex6 240T FPGA, general PEs
Use model structure to improve Avoid using FPGA placement (Phase 2) Graph embedding: Map guest graph to host graph, minim. max wire length Guest Virtual PEs FPGA Host Physical PEs
Phase 2 – Map virtual PEs to physical PEs Guest … Embedding algorithm H-tree embedding Linear embedding Direct map embedding FPGA FPGA FPGA 3 1 2 Host Frank Vahid, UCR 14 [1] Zienicke, P. 1990. Embeddings of Treelike Graphs into 2-Dimensional Meshes. (WG '90). [2] Aleliunas, R., and Rosenberg, A.L. 1982. On Embedding Rectangular Grids in Square Grids. (Computers ‘82). [3] Berman, F., and Snyder, L. 1987. On mapping parallel algorithms into parallel architectures, (PDC, ‘87).
2D grid of physical PEs EqP1 EqV1 EqP2 EqV2 EqP3 EqV3 EqP4 EqV4 EqP5 EqV5 EqP6 EqV6 EqP7 EqV7 EqP4 EqV4 EqP6 EqV6 EqP2 EqV2 EqP2 EqV2 EqP1 EqV1 EqP5 EqV5 EqP7 EqV7 Bypass FPGA placement FPGA (Phase 1: May require "graph folding" first to reduce #PEs)
Compare/backup: Simulated annealing Cost function: C = w1*sum + w2*max + w3*gaps Sum = sum of wire distances Max = max wire length (Euclidean dist.) Gaps = wires across architectural features FPGA FPGA Neighbor function: Swap PEs based on distance to neighbors P2 P1 P1
Results 4 generations shown 5 generations shown 5 generations shown Simulated annealing placement No placement strategy Embedding placement
Results Not routable 2D Neuron model - 256PE – Xilinx Virtex6 No impact on size 20% more power
Graph emb (Gen PEs) Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE:4.9x Grph emb(GPE): 11.2x Miller, B., F. Vahid, and T. Givargis. Embedding-Based Placement of Processing element Networks on FPGAs for Physical Model Simulation. ACM Int. Symp. on FPGAs, 2013. Frank Vahid, UCR 19
Inputs PE Input_sel Address We Data RAM FPGA Digital mockup SUB MUL Controller Interface Controller MUL SUB Const ROM Address Output Custom Processing Element • Custom datapath to solve specific type of equation V’ = F1 – F2 F’ = P1-P2-(F*CR)*CL Custom PE for each ODE type Modified synthesis tool to create custom PEs for given ODEs first, then synthesis ODEs to PEs
Custom PEs General PEs Custom PEs 1522 1490 1430 1184 PC(1) PC(4) GPU 1000 HLS 900 800 700 Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x 600 500 Performance (ms) 400 300 200 100 0 Weibel Neuron Weibel + gas weibel + hemo Hemodynamic Huang, Vahid, Givargis. Synthesis of networks of custom processing elements for real-time physical system emulation. Transactions on Design Automation of Electronic Systems (TODAES), 2013 (to appear). Frank Vahid, UCR 21
FPGA Digital mockup Interface Networks of Heterogeneous PEs • General PE: • Slow, flexible (can solve any types of ODEs) • Custom PE: • Fast, inflexible (only solves one type of ODEs) • Multi-Type PE • Combined multiple types of ODEs into single custom PE Huge solution space: How to choose types of PEs? How many PEs to allocate? How to bind ODEs to PEs? Huang, Miller, Vahid, Givargis. Synthesis of Heterogeneous Processing Elements for Physical System Emulation. CODES+ISSS 2012, Oct, 2012.
Initial random allocation Simulated annealing ODE-to-PE mapper New PE allocation N Best solution Better solution Y Cycles of each PE PE allocator Automatic allocation and binding
Heterogeneous PEs General PEs Custom PEs Heterogeneous PEs 1522 1490 1430 1184 PC(1) PC(4) GPU 1000 HLS 900 800 700 600 Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x Heterog PE: 34.5x 500 Performance (ms) 400 300 200 100 0 Weibel Neuron Weibel + gas weibel + hemo Hemodynamic C. Huang, B. Miller, F. Vahid, T. Givargis. Synthesis of Custom Networks of Heterogeneous Processing Elements for Complex Physical System Emulation. IEEE/ACM Conf on Hardware/Software Codesign and System Synthesis (CODES/ISSS, part of ESWEEK), Finland, Oct 2012. Frank Vahid, UCR 24
Performance (ms): time to emulate 1000 ms, using Euler with 0.01 ms step. Size (equivalent LUTs) Network of general/custom/heterogeneous PEsVS HLS (regularity extraction) Heterogeneous PE: (10x, 1.1x) HLS (7x, 0.85x) general PE (6x, 1.35x) custom PE (Speed, Size)
Speedup / dollar Heterogeneous PEs: 3X better than PC(4) 4.5x better than GPU FPGA: Easier to build custom interfaces CPU (I7-950 + Intel X58 board): $480 GPU(GTX460 + I3-540 + H55 board): $380 FPGA (Xilinx Virtex6 240T-2 board): $1800
Other projects • Assistive monitoring • www.cs.ucr.edu/~vahid/assistivemonitoring/ • http://www.youtube.com/watch?feature=player_embedded&v=Sf8tU-78lXs • ..\Desktop\Fall montage.mp4..\Desktop\Frank_pullChair_013113_cam3.video.wmv • Web-based learning • "Textbook is dead" • Multi-univ synergy • pcpp.zyante.com (C++) • Embedded systems educ. • New prog. model, virtual lab, programmingembeddedsystems.com • Also riosscheduler.org • Drunk driving (DUI) • ..\Desktop\dui.MOV • duicam.org • http://www.utsandiego.com/news/2013/feb/11/ucr-drunken-driving-app/
Summary Speedup vs real-time (avg) PC(1): 0.8x PC(4): 3.1x GPU: 1.6x HLS: 3.2x General PE: 4.9x Grph emb(GPE): 11.2x Custom PE: 6.1x Heterog PE: 34.5x (Grph emb+HPE: 48.5x) • FPGAs: Fastest cost-effective execution of physical models • http://www.youtube.com/watch?v=ThUKVhqoA3Q • Future • Manycore device • Beyond testing CPS • Implement end-products Frank Vahid, UCR 28 http://www.meti.com/
Questions? Frank Vahid, UCR 29