350 likes | 666 Views
F A S T Frequency-Aware Static Timing Analysis. By Kiran Seth, Aravindh Anantaraman, Frank Mueller and Eric Rotenberg Center for Embedded Systems Research Departments of CS & ECE North Carolina State University. Real-Time Systems. Tasks have a deadline must terminate on time
E N D
F A S TFrequency-Aware Static Timing Analysis By Kiran Seth, Aravindh Anantaraman, Frank Mueller and Eric Rotenberg Center for Embedded Systems Research Departments of CS & ECE North Carolina State University
Real-Time Systems • Tasks have a deadline must terminate on time • Classification • Hard Real-time: missed deadline catastrophe • Soft Real-time: missed deadline low QoS. • Multi-tasking real-time systems require scheduling algorithms • Scheduler ensures task arbitration online • Schedulability test ensures met deadlines (static test) • requires known Worst-Case Execution Time (WCET)
Static Timing Analysis • To schedule tasks in Real-time systems, need • Worst-case Execution Time (WCET) and • Worst-case Execution Cycles (WCEC) • Experimental WCET unsafe bounds • Due to input & hardware complexity • Use static timing analysistoolset to obtain safe WCET bounds
Static Instruction Cache Analysis • Work explained in [Mueller RTS-J’00] • Interprocedural data-flow analysis • Predicts each cache reference as one of • always-hit • always-miss • first-hit • first-miss • Each instruction categorized • for each loop level • and function (loop w/ 1 iteration)
Static Data Cache Simulation • For accurate static timing analysis • need data cache analysis • Currently, data cache analysis tool not accurate enough • Too many restrictions, not general enough for real code • Improvements by [Vera RTSS’03] • Solutions • All data accesses hits… highly underestimated. • All data accesses misses… highly overestimated. • Assume big enough cache to fit all data set • Assume first-time accesses as misses (cold misses, only), o/w hits • Accurate? Yes. But what is caches smaller? • No significant impact on this study
Static Timing Analyzer • Path & tree-based approach [Healy IEEE TC’99] • Find nodes in the CFG and derive WCEC for each node • A node is a function or loop • WCET is calculated bottom-up • Standard timing analysis assumptions apply • No recursion • All loop bounds must be known • No function pointers
Motivation of FAST • Dynamic Voltage Scaling (DVS) scheduling schemes • Change frequency/voltage for system • save power without missing deadlines • Several DVS scheduling schemes available • Good fit for real-time systems • Most real-time systems • have low utilization • are low-power embedded systems • Potential for considerable energy savings with DVS
Problem • Current DVS schemes: • Ignore effects of frequency scaling on WCEC • DVS schemes assume: WCEC constant with frequency • Overestimate WCET at lower frequencies • To demonstrate the problem • WCET of C-Lab benchmark static timing analysis tool • For frequencies 100MHz – 1GHz • Assess observed WCEC & WCET vs. assumption made by DVS schemes
Actual vs. Assumed WCEC for FFT WCEC changes with frequency modulation • WCEC increases with higher frequency • Constant memory latency:100ns
Actual vs. Assumed WCET for FFT Difference in chosen frequency for DVS w/ WCET=5ms • assumed: ~ 550 MHz • actual: ~ 150 MHz
Parametric Frequency Model Problem: • DVS • Considers processor frequency scaling • Ignores effect of frequency scaling on memory accesses • With frequency scaling: • Cycles for processor operations remains constant • Except for memory operations problem • DVS schemes overestimate the WCET at lower frequencies • Cannot fully utilize available slack • Power savings potential largely wasted
Parametric Frequency Model Solution: • Calculate WCEC • accounting for effects of memory accesses • using the new parametric frequency model • Model: WCEC(f) = i + mN = i + mLf • i: Invariant # of worst-case cycles (for non-memory operations) • m: # of worst-case memory accesses • N: # of cycles per memory access • depends on memory latency L and frequency f: N = Lf
Using the Parametric Frequency Model A: add R2, R1, R3 B: load R4, [M1] C: add R2, R1, R4 D: add R2, R1, R5 • Instruction sequence simulated through simple pipeline • explain parametric frequency model • Simple pipeline: • 6 stages • Data & instruction cache • N = 10
Example 0: Cache Hits • Recall: B is load instruction WCEC = 9 + 0N • Each row represents pipeline stage. • Time (and cycle count) increases horizontally.
Example 1: Effect of I-cache miss WCEC = 9 + 1N • Stall due to I-cache miss is shown • Model accurately captures memory latency, however long
Example 2: Effect of D-cache miss • Recall: B is load instruction WCEC = 9 + 1N • Stall due to D-cache miss is shown • Again, model captures memory latency, however long • Notice: during stall cycles, no useful work is done
Example 3: Effect of I- & D-cache Miss WCEC = 9 + 2N • I-cache miss first, then D-cache miss • Overlap between useful cycles & stall cycles • Also during high-latency execution operations • E.g. floating-point, multiply, … overlap w/ D-cache miss • Leads to overestimation in practice rare, still safe WCET
Experimental Validation • Combine frequency model with our static timing analyzer • FAST tool • WCEC FAST equations • Experiment to validate results from FAST tool • Run benchmarks through FAST tool • An equation representing WCEC for benchmark obtained • Run same benchmarks through traditional timing analysis tool • Vary frequencies: 100MHz-1GHz
Frequency-Aware Static Timing Analysis (FAST) • FAST tool “as accurate” as traditional static timing analysis • Slight overestimation in case of floating-point benchmarks
FAST in EDF Scheduling with DVS • DVS with EDF: Ck/Pk , where =fc/fm • FAST with EDF: (ik+mkLfm)/Pkfm • Schedulability test: (ik/Pk) / fm (1 - L mk/Pk) • Implemented frequency model for 3 EDF-DVS algorithms • Algorithms by [Pillai & Shin] • Look-ahead improved: • @ completion, consider next deadline • up to 34% additional energy savings (5-11% on avg.), low U • but 0.5-8% less savings at high utilization
Improving DVS schemes • Use parametric frequency model to improve DVS schemes • provide accurate WCET • Improved energy savings • Architectural Simulator: SimpleScalar+Wattch [Brooks ISCA’00] • 6-stage simple in-order pipeline processor model • I-cache and D-cache (8KB each) • Run 4-8 tasks simultaneously (scheduler runs as its own task) • More accurate than E ~ V2f model ? • Results newer than paper
Static RT-DVS vs. FAST Static RT-DVS • Base case: EDF • Tasks at 1GHz • Idle: 100MHz • no sleep mode small task periods • tasksets • 1: integer • 2: float • 3: mix • Static scheme better than base EDF 12-60% energy savings • FAST-Static even better 40-78% savings • high + lower utilization
Cycle-conserving RT-DVS vs. FAST cycle-conserving RT-DVS • dynamic scheduling early completion, reclaimed as slack • Cycle-conserving 57-72% energy savings • FAST 71-80% savings
Look-ahead RT-DVS vs. FAST Look-ahead RT-DVS • most aggressive DVS: early completion + max. deferral • Look-ahead: slightly higher savings than cycle-conserving @ 68-80% • FAST: slightly better in most cases @ 72-83%
E ~ V2f model Higher savings: up to 96% ? Ratio look-ahead / FAST similar Wattch detailed power model Probably more accurate Look-ahead RT-DVS vs.FAST Look-ahead RT-DVS
Conclusion • Energy savings in real-time systems can be significantly improved by considering the effects of frequency scaling on WCET • FAST + Static RT-DVS • as good as Look-Ahead RT-DVS • less overhead • The parameterized frequency model can easily track effects of frequency scaling on WCET • FAST tool works best when • Many cache misses • If D-cache analysis is highly inaccurate (usually true) • FAST can make up for it • High memory latency • Insufficient dynamic slack reclaiming (during DVS scheduling) • Integrated into real-time hardware support [VISA ISCA’03]
Old DVS Scheduling Simulator • Event based simulator of scheduler. • Have to assume miss rate for the tasks in dynamic schemes. • Uses E ~ V2f energy model. • Gives a good idea about savings, BUT accurate ??
DVS schemes (Pillai & Shin) • Static RT-DVS – Uses static slack available in the schedule. • Cycle-conserving RT-DVS – Uses static slack + dynamic slack due to early completion. • Look-ahead RT-DVS – Uses static slack + dynamic slack due to early completion + latest possible scheduling (look-ahead).
Complexity • Original EDF test O(n) • Modified EDF test still O(n)