F A S T Frequency-Aware Static Timing Analysis

F A S TFrequency-Aware Static Timing Analysis By Kiran Seth, Aravindh Anantaraman, Frank Mueller and Eric Rotenberg Center for Embedded Systems Research Departments of CS & ECE North Carolina State University

Real-Time Systems • Tasks have a deadline must terminate on time • Classification • Hard Real-time: missed deadline  catastrophe • Soft Real-time: missed deadline  low QoS. • Multi-tasking real-time systems require scheduling algorithms  • Scheduler ensures task arbitration online • Schedulability test ensures met deadlines (static test) • requires known Worst-Case Execution Time (WCET)

Static Timing Analysis • To schedule tasks in Real-time systems, need • Worst-case Execution Time (WCET) and • Worst-case Execution Cycles (WCEC) • Experimental WCET  unsafe bounds • Due to input & hardware complexity • Use static timing analysistoolset to obtain safe WCET bounds

Static Instruction Cache Analysis • Work explained in [Mueller RTS-J’00] • Interprocedural data-flow analysis • Predicts each cache reference as one of • always-hit • always-miss • first-hit • first-miss • Each instruction categorized • for each loop level • and function (loop w/ 1 iteration)

Static Data Cache Simulation • For accurate static timing analysis • need data cache analysis • Currently, data cache analysis tool not accurate enough • Too many restrictions, not general enough for real code • Improvements by [Vera RTSS’03] • Solutions  • All data accesses hits… highly underestimated. • All data accesses misses… highly overestimated. • Assume big enough cache to fit all data set • Assume first-time accesses as misses (cold misses, only), o/w hits • Accurate? Yes. But what is caches smaller? • No significant impact on this study

Static Timing Analyzer • Path & tree-based approach [Healy IEEE TC’99] • Find nodes in the CFG and derive WCEC for each node • A node is a function or loop • WCET is calculated bottom-up • Standard timing analysis assumptions apply  • No recursion • All loop bounds must be known • No function pointers

Motivation of FAST • Dynamic Voltage Scaling (DVS) scheduling schemes • Change frequency/voltage for system • save power without missing deadlines • Several DVS scheduling schemes available • Good fit for real-time systems • Most real-time systems • have low utilization • are low-power embedded systems • Potential for considerable energy savings with DVS

Problem • Current DVS schemes: • Ignore effects of frequency scaling on WCEC • DVS schemes assume: WCEC constant with frequency • Overestimate WCET at lower frequencies • To demonstrate the problem • WCET of C-Lab benchmark  static timing analysis tool • For frequencies 100MHz – 1GHz • Assess observed WCEC & WCET vs. assumption made by DVS schemes

Actual vs. Assumed WCEC for FFT WCEC changes with frequency modulation • WCEC increases with higher frequency • Constant memory latency:100ns

Actual vs. Assumed WCET for FFT Difference in chosen frequency for DVS w/ WCET=5ms • assumed: ~ 550 MHz • actual: ~ 150 MHz

Parametric Frequency Model Problem: • DVS • Considers processor frequency scaling • Ignores effect of frequency scaling on memory accesses • With frequency scaling: • Cycles for processor operations remains constant • Except for memory operations  problem • DVS schemes overestimate the WCET at lower frequencies • Cannot fully utilize available slack • Power savings potential largely wasted

Parametric Frequency Model Solution: • Calculate WCEC • accounting for effects of memory accesses • using the new parametric frequency model • Model: WCEC(f) = i + mN = i + mLf • i: Invariant # of worst-case cycles (for non-memory operations) • m: # of worst-case memory accesses • N: # of cycles per memory access • depends on memory latency L and frequency f: N = Lf

Using the Parametric Frequency Model A: add R2, R1, R3 B: load R4, [M1] C: add R2, R1, R4 D: add R2, R1, R5 • Instruction sequence simulated through simple pipeline • explain parametric frequency model • Simple pipeline: • 6 stages • Data & instruction cache • N = 10

Example 0: Cache Hits • Recall: B is load instruction WCEC = 9 + 0N • Each row represents pipeline stage. • Time (and cycle count) increases horizontally.

Example 1: Effect of I-cache miss WCEC = 9 + 1N • Stall due to I-cache miss is shown • Model accurately captures memory latency, however long

Example 2: Effect of D-cache miss • Recall: B is load instruction WCEC = 9 + 1N • Stall due to D-cache miss is shown • Again, model captures memory latency, however long • Notice: during stall cycles, no useful work is done

Example 3: Effect of I- & D-cache Miss WCEC = 9 + 2N • I-cache miss first, then D-cache miss • Overlap between useful cycles & stall cycles • Also during high-latency execution operations • E.g. floating-point, multiply, …  overlap w/ D-cache miss • Leads to overestimation in practice rare, still safe WCET

Experimental Validation • Combine frequency model with our static timing analyzer • FAST tool • WCEC  FAST equations • Experiment to validate results from FAST tool • Run benchmarks through FAST tool • An equation representing WCEC for benchmark obtained • Run same benchmarks through traditional timing analysis tool • Vary frequencies: 100MHz-1GHz

Frequency-Aware Static Timing Analysis (FAST) • FAST tool  “as accurate” as traditional static timing analysis • Slight overestimation in case of floating-point benchmarks

FAST in EDF Scheduling with DVS • DVS with EDF: Ck/Pk , where =fc/fm • FAST with EDF:  (ik+mkLfm)/Pkfm   • Schedulability test:  (ik/Pk) / fm (1 - L mk/Pk)   • Implemented frequency model for 3 EDF-DVS algorithms • Algorithms by [Pillai & Shin] • Look-ahead improved: • @ completion, consider next deadline • up to 34% additional energy savings (5-11% on avg.), low U • but 0.5-8% less savings at high utilization

Improving DVS schemes • Use parametric frequency model to improve DVS schemes • provide accurate WCET • Improved energy savings • Architectural Simulator: SimpleScalar+Wattch [Brooks ISCA’00] • 6-stage simple in-order pipeline processor model • I-cache and D-cache (8KB each) • Run 4-8 tasks simultaneously (scheduler runs as its own task) • More accurate than E ~ V2f model ? • Results newer than paper

Static RT-DVS vs. FAST Static RT-DVS • Base case: EDF • Tasks at 1GHz • Idle: 100MHz • no sleep mode  small task periods • tasksets • 1: integer • 2: float • 3: mix • Static scheme better than base EDF  12-60% energy savings • FAST-Static even better  40-78% savings • high + lower utilization

Cycle-conserving RT-DVS vs. FAST cycle-conserving RT-DVS • dynamic scheduling  early completion, reclaimed as slack • Cycle-conserving  57-72% energy savings • FAST  71-80% savings

Look-ahead RT-DVS vs. FAST Look-ahead RT-DVS • most aggressive DVS: early completion + max. deferral • Look-ahead: slightly higher savings than cycle-conserving @ 68-80% • FAST: slightly better in most cases @ 72-83%

E ~ V2f model Higher savings: up to 96% ? Ratio look-ahead / FAST similar Wattch detailed power model Probably more accurate Look-ahead RT-DVS vs.FAST Look-ahead RT-DVS

Conclusion • Energy savings in real-time systems can be significantly improved by considering the effects of frequency scaling on WCET • FAST + Static RT-DVS • as good as Look-Ahead RT-DVS • less overhead • The parameterized frequency model can easily track effects of frequency scaling on WCET • FAST tool works best when  • Many cache misses • If D-cache analysis is highly inaccurate (usually true) • FAST can make up for it • High memory latency • Insufficient dynamic slack reclaiming (during DVS scheduling) • Integrated into real-time hardware support [VISA ISCA’03]

BACKUP SLIDES

The V2f model

Old DVS Scheduling Simulator • Event based simulator of scheduler. • Have to assume miss rate for the tasks in dynamic schemes. • Uses E ~ V2f energy model. • Gives a good idea about savings, BUT accurate ??

Static RT-DVS vs. FAST Static RT-DVS

Cycle-conserving RT-DVS vs.FAST cycle-conserving RT-DVS

Look-ahead RT-DVS vs.FAST Look-ahead RT-DVS

DVS schemes (Pillai & Shin) • Static RT-DVS – Uses static slack available in the schedule. • Cycle-conserving RT-DVS – Uses static slack + dynamic slack due to early completion. • Look-ahead RT-DVS – Uses static slack + dynamic slack due to early completion + latest possible scheduling (look-ahead).

Complexity • Original EDF test  O(n) • Modified EDF test  still O(n)

F A S T Frequency-Aware Static Timing Analysis

F A S T Frequency-Aware Static Timing Analysis

Presentation Transcript

STATIC TIMING ANALYSIS

S T U F F

T h e A B C ’s o f S F T

O T T F F S S E __

O T T F F S S E __

O T T F F S S E __

Static Path-Aware Analysis of Program Invariants

O T T F F S S E __

Large-Scale Static Timing Analysis

Static Timing Analysis for Threshold Logic Circuits

Statistical Static Timing Analysis

Continuing Challenges in Static Timing Analysis

S A F E T Y

Final Project: Static Timing Analysis on GPGPU

O T T F F S S E __

Chapter 4b Statistical Static Timing Analysis: SSTA

Static Timing Analysis