360 likes | 386 Views
Microprocessors, Advanced Partitioning an Embedded System for Multicore Design. January 31, 2012 Jack Ganssle. The Schedule Grows Faster Than The Code!. IBM: person-yrs LOC/month 1 439 10 220 100 110 1000 55 COCOMO: Schedule = C * KLOC M
E N D
Microprocessors, AdvancedPartitioning an Embedded System for Multicore Design January 31, 2012Jack Ganssle
The Schedule Grows Faster ThanThe Code! IBM: person-yrs LOC/month 1 439 10 220 100 110 1000 55 COCOMO: Schedule = C * KLOCM (C and M are both > 1)
Partitioning Code Fact: The easiest way to write great modules fast is to keep them small, with few dependencies. • Smaller functions have: • fewer bugs: bug rate is 2 to 6x lower • more likely to meet specs • done faster.
We Turn Micros into Mainframes 8051 - Sensors - Interface 1,000,000 lines of code
A Better Design I/O Code I/O Code supervisory code I/O Code
Interprocessor Communications Serial/Encrypt I2C – a fast serial interface Rangefinder Main CPU Transaction Processing
The Tradeoff schedule quality features
Requirements Scrubbing = 73.6%!
Don’t Wait for Hardware • Build an I/O board that plugs into the PC • Simulate! • Virtualization – Virtutech, CoWare, VaST • Fitnesse: http://fitnesse.org/ • Catsrunner: • www.agilerules.com/projects/catsrunner/index.phtml
What About Multicore? CPU Memory Hundreds of nsec Tens of MHz
Then Came Prefetchers Queue CPU Memory Under 100 nsec Tens of MHz
Then Came Pipelines CPU Memory 30-50 nsec Tens of MHz Old: Fetch -> Decode -> Execute Pipelined: Fetch Decode Execute
Cache CPU speed CPU Cache Hundreds of MHz Memory 30-50 nsec
Cache Splits in Two CPU speed CPU L1 Cache Over 1 GHz L2 Cache 3-5 nsec Memory 30-50 nsec
SMP Symmetric Multiprocessing (SMP) – multiple identical CPUs working with a shared memory array. CPU Core CPU Core Shared memory
Amdahl’s Law for SMP Max speedup = Where: n = Number of processors f = Percent of operation that can not be parallelized
With an Infinite # CPUs Speedup Portion not parallelizable
Best Case: 66% Parallelizable Speedup Number of cores
But Memory is a Bottleneck! CPU Core CPU Core L1 Cache L1 Cache Typically 32KB Shared L2 Cache Typically 2-4MB Memory
And so is Comm CPU Core CPU Core CPU Core CPU Core L1 Cache L1 Cache L1 Cache L1 Cache Shared L2 Cache Shared L2 Cache Memory Then there’s the cache coherency problem
The Irony • Programs in L1 run blazingly fast • But why use a 32 bit CPU that can • address 4 GB on a 32 KB program?
A Colorimeter SMP Design - Read A/D - FIFO data - Do FIR - Calculate R - Display - Read A/D - FIFO data - Do FIR - Calculate R - Display - Read A/D - FIFO data - Do FIR - Calculate R - Display A/D A/D A/D Display Display Display Core R Core G Core B Common Bus Memory
ASMP Asymmetric Multiprocessing (ASMP or AMP) – Multiple CPUS, identical or not, each running a specific activity CPU Core CPU Core Memory Memory Some comm link
A More Natural Design via AMP A/D FIFO FIR A/D FIFO FIR A/D FIFO FIR Calc R Display Display Display Calc G Calc B
Another Assembly Line CPU CPU CPU Memory Memory Memory Memory CPU Data
Implications Multicore can give huge performance improvements. But for non-parallel problems they may not yield much improvement. It’s hard to impossible to predict speed improvements of most algorithms once they grow larger than L1 Many embedded apps are hugely non-parallelizable. In some cases AMP offers a better solution than SMP