300 likes | 313 Views
This presentation will cover practical techniques for optimizing embedded systems, including following a process, setting quantitative goals, considering platform architecture, optimizing algorithms, estimation and modeling, helping the compiler, power considerations, and multi-core systems.
E N D
Embedded TechConPractical Techniques for Embedded System Optimization Processes Rob Oshana robert.Oshana@freescale.com
Agenda • Follow a process • Define the goals, quantitatively • The platform architecture makes a big difference • Don’t be naïve about the algorithms • Do some estimation and modelling • Help out the compiler if possible • Power is becoming more important • What about multiple cores? • Track what you are doing
There is a right way and a wrong way Donald Knuth; "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil” Discipline and an iterative approach are the keys to effective serial performance tuning measurements and careful analysis to guide decision making change one thing at a time meticulously re-measure to confirm that changes have been beneficial
Symptoms • Excessive optimization • Premature optimization • These consume project resources, delay release, compromise software design w/o directly improving performance • Fixation on efficiency • Model first before optimizing There are always tradeoffs
Functional; “The embedded software shall…(monitor, control, etc)” • Non-Functional; “The embedded software shallbe..(fast, reliable, scalable, etc) Spend time up front understand your non-functional requirements “it has to be really fast” “it has to be able to kick <competitor A>’s butt” -- examples of real performance “requirements” IPFwd Fast Kpps Should 600 Must 550 Functional = what the system should do Non-functional = how well the system should do it
There is a Difference Between Latency and Throughput “It is not possible to determine both the position and momentum of an object beyond a certain amount of precision.” -Heisenberg’s Principle Similarly, it is not possible to design a system that provides both low latency and high performance However, real-world systems (such as Media, eNodeB, etc.) need both Need to tune the system for the right balance of latency and performance Latency; 10usec avg, 50 uses max wake up latency for RT Tasks Throughput; 50Mbps UL, 100 Mbps DL for 512B packets
CPU (latency oriented cores) Map the application to the core Or offload to the cloud GPU (throughput oriented cores)
Estimating Embedded Performance can be done prior to writing the code • Maximum CPU Performance “What is the maximum number of times the CPU can execute your algorithm? (max # channels) • Maximum I/O Performance “Can the I/O keep up with this maximum #channels?” • Available Hi-Speed Memory “Is there enough hi-speed internal memory?” 1. CPU Load (% of maximum) 2. At this CPU Load, what other functions can I perform?
FIR benchmark: (nx/2) (nh+7) = 128 * 207 = 26496 cyc/frm #times frm full/s: (samp freq / frm size) = 48000/256 = 187.5 frm/s MIP calc: (frm/s) (cyc/frm) = 187.5 * 26496 = 4.97M cyc/s Conclusion: FIR takes ~5MIPs on Embedded Core XYZ Max #channels: 60 @300MHz C P U Required I/O rate: 48Ksamp/s * #Ch = 48000 * 16 * 60 = 46.08 Mbps DSP SP rate: serial port is full duplex 50.00 Mbps DMA Rate: (2x16-bit xfrs/cycle) * 300MHz = 9600 Mbps Req’d Data Mem: (60 * 200) + (60 * 4 * 256) + (60 * 2 * 199)= 97K x 16-bit Avail int’l mem: 32K x 16-bit I / O Example – Performance Calculation How many channels can the core handle given this algorithm? Max # channels: does not include overhead for interrupts, control code, RTOS, etc. Are the I/O and memory capable of handling this many channels? Algorithm: 200-tap (nh) low-pass FIR filter Frame size: 256 (nx) 16-bit elements Sampling frequency: 48KHz X Required memory assumes: 60 different filters, 199 element delay buffer, double buffering rcv/xmt
CPU Load Graph 20% 2 1 CPU Load Graph 100% GPP Estimation results drive options Application: simple, low-end (CPU Load 5-20%) What do you do with the other 80-95%? • Additional functions/tasks • Increase sampling rate (increase accuracy) • Add more channels • Decrease voltage/clock speed (lower power) or + Core Core Acc Application: complex, high-end (CPU Load 100%+) How do you split up the tasks wisely? • GPP/uC (user interface), DSP (all signal processing) • DSP (user i/f, most signal proc), FPGA (hi-speed tasks) • GPP (user i/f), DSP (most signal proc), FPGA (hi-speed) + + DSP Acc
Help out the compiler • A compiler maps high-level code to a target platform • Preserves the defined behavior of the high level language • The target may provide functionality that is not directly mapped into the high level language • Application may use algorithmic concepts that are not handled by the high level language • Understanding how the compiler generates code is important to writing code that will achieved desired results
Big compiler impact 1; ILP restrict enables SIMD optimizations Stores may alias loads. Must perform operations sequentially. Independent loads and stores. Operations can be performed in parallel!
Big compiler impact 2; data locality for (i=0; i<N; i+=2) for (j=0; j<N; j++) { A[i][j] = B[j][i]; A[i+1][j] =B[j][i+1]; } for (i=0;i<N;i++) for (j=0;j<N;j++) A[i][j] = B[j][i]; • Spatial locality of B enhanced • Unroll outer loop and fuse new copies of the inner loop • Increases size of loop body and hence available ILP • General guideline; Align computation and locality
Use the right algorithm think Big O O(n^2) vs O(nlogn) 40 cycles 100 cycles 200 cycles
Understand Performance Patterns (and anti-patterns) (green – data, red – control, blue – termination)
Power Optimization; Active vs. Static Power • Power consumption in CMOS circuits: P total = Pactive + Pstatic • C = charge (q) / voltage (V), • q = CV • W = V * q, or in other terms, W = V * CV or W = CV² • Power is defined as work over time, or in this discussion it is how many times a second we oscillate the circuit. • P = (work) W / (time) T and since T = 1/F then P = WF or substituting, P = CV²F
Top Ten Power Optimization Techniques Architect SW to have natural “idle” points (inc. low power boot) Use interrupt-driven programming (no polling, use OS to block) Code and data placement close to processor to minimize off-chip accesses (and overlays from non-volatile to fast memory) Smart placement to allow frequently accessed code/data close to CPU (and use hierarchical memory models) Size optimizations to reduce footprint, memory and corresponding leakage Optimize for speed for more CPU idle modes or reduced CPU frequency (benchmark and experiment!) Don’t over calculate, use minimum data widths, reduce bus activity, smaller multipliers Use DMA for efficient transfer (not CPU) Use co-processors to efficiently handle/accelerate frequent/specialized processing Use more buffering and batch processing to allow more computation at once and more time in low power modes Use OS to scale V/F and analyze/benchmark (make right 1st !!)
When you have more than 1 core to optimize (multicore) Goal; Exploit multicore resources Step 1; Optimize serial implementation easier less time consuming less likely to introduce bugs reduce the gap less parallelization is needed allows parallelization to focus on parallel behavior, rather than a mix of serial and parallel issues Serial optimization is not the end goal Apply changes that will facilitate parallelization and the performance improvements that parallelization can bring Serial optimizations that interfere with, or limit parallelization should be avoided avoid introducing unnecessary data dependencies exploiting details of the single core hardware architecture (such as cache capacity)
There’s Amdahl and then there’s Gustafson (know the difference) Time/Problem Size • Conventional Wisdom • Speedup decreases with increasing portion of serial code (S)- diminishing returns • Imposes fundamental limit (1/S) on speedup • Assumes parallel vs. serial code ratio is fixed for any given application – unrealistic? • Theoretical Max? • Applies to applications without a fixed code ratio – e.g. networking/routing • Speedup becomes proportional to the number of cores in the system • Packet processing provides opportunity for parallelism
Multithreaded Programming has some hazards Deadlock Livelock False sharing Data hazards Lock contention
Optimize for best case scenario, not worst case No Lock Contention – No System Call
Optimize for best case scenario, not worst case Since most operations will not require arbitration between processes, this will not happen in most cases Very useful in low number of threads scenario
Top Ten Performance Optimization Techniques for Multicore Achieve proper load balancing Improved data locality and reduce false sharing Affinity Scheduling if necessary Lock granularity Lock frequency & ordering Remove sync barrier Async vs sync communication Scheduling Worker thread pool Manage thread count Use parallel libraries (pthreads, openMP, etc)
Recommendation; Start developing crawl charts 1.DL throughput : 60 Mbps (with MCS=27 DL MIMO) 2.UL throughput : 20 Mbps (with MCS=20 for UL )
Recommendation; Form a Performance Engineering Team As content is upstreamed Feature Content Feature Content Configuration Settings Configuration Settings Repository / Branch / Patches SoC Features and NPIs Upstream Kernel SoCKernel Performance Engineering Feature Merge Feature Integration