Understanding Heterogeneous Systems

Lecture 16 Heterogeneous Systems (Thanks to Wen-Mei Hwu for many of the figures) Lecture 16

What are Heterogeneous Systems? • Programmable -- not restricted to one particular application, though may be heavily optimized for a class of applications. • Multi-core -- Multiple, independent, execution units on a chip • Some people are starting to use the term “many-core” for architectures where there are enough cores that you have to use a non-sequential programming model to get full performance out of the system. • Heterogeneous -- Cores are different • Optimize cores for specific types of applications • Can schedule for performance or power Lecture 16

Why are they Interesting? Embedded applications have tough performance and power requirements Example: GSM decoder requires 10 Minst/second in software Motorola V70 GSM cell phone has power budget of approximately 0.8 watts total when in use. • Includes both encode and decode • Includes microphone, speaker, radio Lecture 16

Application-Specific Integrated Circuits Custom Logic Buffer Custom Logic Input Data Output Data Control CPU Lecture 16

Why Not Keep Using ASICs? • Decreasing Product Cycles • Design Time/Cost • Transistors/chip rising at 50%/year • Transistors/designer day rising at 10%/year • Re-usable cores helping some, but not enough • Mask cost greater than $1M • Need to fabricate many chips to justify a design • Lack of Flexibility • More and more, consumers want multifunction devices (ex. Cell phone with camera) • Increases design time, cost Lecture 16

Why Heterogeneous Systems? • Different parts of programs have different requirements • Control-intensive portions need good branch predictors, speculation, big caches to achieve good performance • Data-processing portions need lots of ALUs, have simpler control flows • Power Consumption • Features like branch prediction, out-of-order execution, tend to have very high power/performance ratios. • Applications often have time-varying performance requirements • Observation: Much of the performance, power advantages of ASICs comes from application-specific memory, not application-specific processing Lecture 16

Changing Memory to Communication CPU CPU DRAM PE’s DRAM Az_4 PE’s Az_4 Weight_Ai (Az, F_ga3, Ap3) Weight_Ai (Az, F_g4, Ap4) synth synth Residu (Ap3, &syn_subfr[i],) res2 res2 Copy (Ap3, h, 11) Weight_Ai Weight_Ai Set_zero (&h[11], 11) m_syn m_syn (Ap4, h, h, 22, &h) Syn_filt Copy+ F_g3 Residu F_g3 Set_zero tmp = h[0] * h[0]; for (i = 1 ; i < 22 ; i++) tmp = tmp + h[i] * h[i]; F_g4 F_g4 tmp1 = tmp >> 8; Syn_filt tmp = h[0] * h[1]; for (i = 1 ; i < 21 ; i++) syn syn D R A M tmp = tmp + h[i] * h[i+1]; tmp2 = tmp >> 8; Corr0/Corr1 if (tmp2 <= 0) Ap3 Ap3 tmp2 = 0; else tmp2 = tmp2 * MU; Ap4 preemph Ap4 tmp2 = tmp2/tmp1; preemphasis (res2, temp2, 40) h h Syn_filt Syn_filt (Ap4, res2, &syn_p), tmp tmp 40, mem_syn_pst, 1); tmp1 tmp1 agc (&syn[i_subfr], &syn) agc 29491, 40) tmp2 tmp2 Lecture 16

View from source code • Note how memory operations dominate • Note presence of “expensive” instructions Lecture 16

Residu Syn_filt * * * * * * * + + preemphasis [39:0] [39:0] * [0:39] [0:39] MEM time res Not as Easy as it Looks Order of access to data may make transforming memory ops into communication hard Lecture 16

Remove anti-dependence by array renaming • Apply loop reversal to match producer/consumer I/O • Convert array access to inter-component communication Residu * * * preemphasis + + res * Syn_filt res2 * * * * time Interprocedural pointer analysis + array dependence test + array access pattern summary+ interprocedural memory data flow Compilers to the Rescue! Lecture 16

Heterogeneous Processor Vision General-purpose processor orchestrates activity LOCAL Memory transfer module schedules system-wide bulk data movement MEMORY Accelerators can use scheduled, streaming communication… or can operate on locally-buffered data pushed to them in advance GPP ACC Y R N O I MTM A M M E M ACC ACC LOCAL MEMORY Accelerated activities and associated private data are localized for bandwidth, power, efficiency Lecture 16

Micro engine Micro engine Micro engine Micro engine SPI4 / CSIX RFIFO TFIFO Micro engine Micro engine Micro engine Micro engine XScale Core PCI Hash Engine Micro engine Micro engine Micro engine Micro engine Scratch- pad SRAM RDRAM RDRAM RDRAM Micro engine Micro engine Micro engine Micro engine CSRs QDR SRAM QDR SRAM QDR SRAM QDR SRAM Intel Network Processor -- Existing Example Lecture 16

STI Cell Processor-- Emerging Example Power Processor Element (PPE) (Simplified 64-bit PowerPC with VMX) Dual configurable High-speed channels (38.4 GB/sec ea.) Memory Controller RAM I/O Controller EIB Memory Controller RAM I/O Controller Dual 12.8 GB/sec memory busses. SPE1 SPE5 SPE2 SPE6 SPE3 SPE7 Synergistic Processing Element (SPE) SPE4 SPE8 Element Interconnect Bus (EIB) internal communication system. Lecture 16

Overview of the Rest of the Semester • This is the last formal lecture • If we haven’t covered it already, we can’t really expect you to use it on your projects • Final project proposal due Tuesday in class • I’ll be in my office (208 CSL) during class on 3/27 to provide an opportunity to discuss project issues • Quiz 2 is 3/29 • Final project demos are 5/3 Lecture 16

Understanding Heterogeneous Systems

Understanding Heterogeneous Systems

Presentation Transcript

Lecture 16

Lecture 16

Lecture 16

Lecture 16

Lecture 16

Lecture 16

Lecture # 16

LECTURE 16

Lecture 16

Lecture#16

Lecture 16

Lecture 16

Lecture 16

Lecture (16)

Lecture 16

Lecture 16

Lecture 16

LECTURE 16

Lecture 16

Lecture 16

Lecture 16