360 likes | 519 Views
Embedded Computer Architecture 5KK73 MPSoC Platforms. Part2: Cell Bart Mesman and Henk Corporaal. The Complexity Crisis. I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone. --Bjarne Stroustrup.
E N D
Embedded Computer Architecture5KK73MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal
The Complexity Crisis I have always wished that my computer would be as easy to use as my telephone. My wish has come true. I no longer know how to use my telephone. --Bjarne Stroustrup
The first SW crisis Time Frame: ’60s and ’70s • Problem: Assembly Language Programming • Computers could handle larger more complex programs • Needed to get Abstraction and Portability without losing Performance • Solution: • High-level languages for von-Neumann machines FORTRAN and C
The second SW crisis Time Frame: ’80s and ’90s • Problem: Inability to build and maintain complex and robust applications requiring multi-million lines of code developed by hundreds of programmers • Computers could handle larger more complex programs • Needed to get Composability and Maintainability • High-performance was not an issue: left for Moore’s Law
Solution • Object Oriented Programming • C++, C# and Java • Also… • Better tools • Component libraries, Purify • Better software engineering methodology • Design patterns, specification, testing, code reviews
Today: Programmers are Oblivious to Processors • Solid boundary between Hardware and Software • Programmers don’t have to know anything about the processor • High level languages abstract away the processors • Ex: Java bytecode is machine independent • Moore’s law does not require the programmers to know anything about the processors to get good speedups • Programs are oblivious of the processor -> work on all processors • A program written in ’70 using C still works and is much faster today • This abstraction provides a lot of freedom for the programmers
Contents • Hammer your head against 4 walls • Or: Why Multi-Processor • Cell Architecture • Programming and porting • plus case-study
What’s stopping them? • General-purpose uni-cores have stopped historic performance scaling • Power consumption • Wire delays • DRAM access latency • Diminishing returns of more instruction-level parallelism
Performance µProc: 55%/year 1000 CPU 100 Processor-Memory Performance Gap:(grows 50% / year) “Moore’s Law” 10 DRAM: 7%/year DRAM 1 2005 1980 1985 1990 1995 2000 Time [Patterson] Memory
Now what? • Latest research drained • Tried every trick in the book So: We’re fresh out of ideas Multi-processor is all that’s left!
Low power through parallelism • Sequential Processor • Switching capacitance C • Frequency f • Voltage V • P = fCV2 • Parallel Processor (two times the number of units) • Switching capacitance 2C • Frequency f/2 • Voltage V’ < V • P = f/2 2C V’2 =fCV’2
Architecture methodsPowerful Instructions (1) MD-technique • Multiple data operands per operation • SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3)
* * * * Architecture methodsPowerful Instructions (1) • Sub-word parallelism • SIMD on restricted scale: • Used for Multi-media instructions • Motivation: use a powerful 64-bit alu as 4 x 16-bit alus • Examples • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II • Example: i=1..4|ai-bi|
MPSoC Issues • Homogeneous vs Heterogeneous • Shared memory vs local memory • Topology • Communication (Bus vs. Network) • Granularity (many small vs few large) • Mapping • Automatic vs manual parallelization • TLP vs DLP • Parallel vs Pipelined
Cell/B.E. - the history • Sony/Toshiba/IBM consortium • Austin, TX – March 2001 • Initial investment: $400,000,000 • Official name: STI Cell Broadband Engine • Also goes by Cell BE, STI Cell, Cell • In production for: • PlayStation 3 from Sony • Mercury’s blades
Cell/B.E. – the architecture • 1 x PPE 64-bit PowerPC • L1: 32 KB I$ + 32 KB D$ • L2: 512 KB • 8 x SPE cores: • Local store: 256 KB • 128 x 128 bit vector registers • Hybrid memory model: • PPE: Rd/Wr • SPEs: Asynchronous DMA • EIB: 205 GB/s sustained aggregate bandwidth • Processor-to-memory bandwidth: 25.6 GB/s • Processor-to-processor: 20 GB/s in each direction
Send the code of the function to be run on SPE 1 Send address to fetch the data 2 DMA data in LS from the main memory 3 Run the code on the SPE 4 DMA data out of LS to the main memory 5 Signal the PPE that the SPE has finished the function 6 C++ on Cell
Conclusions • Multi-processors inevitable • Huge performance increase, but… • Hell to program • Got to be an architecture expert • Portability?