Timing Analysis for Modern Architectures

Timing Analysis for Modern Architectures Sang Lyul Min Dept. of Computer Engineering Seoul National University

Overview • Intra-task analysis (WCET analysis) • Cache memory • Pipelined execution • Inter-task analysis • Cache memory • Experiments • Conclusions and Future Work

Intra-task Analysis • Why WCET analysis is important? • Safe and tight WCET (worst case execution time) estimate is a prerequisite of correct and accurate schedulability analysis

Schedulability AnalysisExamples • Utilization bound-based approach • Response time-based approach

Good Old Days • No cache memory • No pipelined execution Fixed instruction execution times (Simple table look-up)

Timing Schema • S: S1; S2 • S: if (exp) then S1 else S2 • S: while (exp) S1

So what is the problem?

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 x x x x x IF x x x RD x x FRD x x x ALU x x x x x x FALU MD x x FMUL x x x x x x x x FDIV x x x MEM x x FMEM x x x WB x x FWB x x FFWB Pipelined Execution div.s $f2, $f4, $f6 lw $8, 4($sp) nop mul.s $f8, $f10, $f12 addiu $9, $8, 4

x x x x x x x x x x x x x x x x 1 2 3 4 5 6 x x x x IF x x x x RD ALU x x x x MD DIV 1 2 3 4 5 6 7 8 9 MEM x x x x WB IF x x x x RD x x x ALU x x x x MD DIV x x MEM x x x WB The Problem 1 2 3 4 5 6 7 8 9 10 IF RD ALU MD DIV MEM WB

Our Approach • Define PA (Path Abstraction) structure which encodes • elements whose timings are affected • elements that affect other’s timings • Define  op on PAs  + op • Define pruning op on PAs  max op

Instruction Cache Modeling cache contents cache block 0 b2 ? b2 b4 b2 cache block 1 ? ? b3 b3 b3 b2 b3 b2 b4 (hit/miss) (hit/miss) (hit) (miss)

b4 b2 b3 b3 PA Structure for Instruction Cache last_reference first_reference texecution = 38 cycles

pruning Example: Concatenation and Pruning first last b6 b6 b1 b1 first 48 cycles last first last b6 b8 b6 b8 first last first last b1 b7 b8 b6 b4 b1 b8 b5 b7 b7 b5 b5 102 cycles 126 cycles 78 cycles 68 cycles

x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Pipelined Execution Modeling div.s $f2, $f4, $f6 lw $8, 4($sp) nop mul.s $f8, $f10, $f12 addiu $9, $8, 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 IF RD FRD ALU FALU MD FMUL FDIV MEM FMEM WB FWB FFWB

tail head PA Structurefor Pipelined Execution 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 x x x x x IF x x x RD x x FRD x x x ALU x x x x x x FALU MD x x FMUL x x x x x x x x FDIV x x x MEM x x FMEM x x x WB x x FWB x x FFWB tmax = 21 cycles

PA Structurefor Pipelined Execution head tail x x x x x x x x x x x x x x x x x x x x texecution = 38 cycles

1 2 3 4 5 13 14 15 16 17 IF RD FRD ALU FALU MD FMUL FDIV MEM FMEM WB FWB FFWB 20 1 2 3 4 5 10 11 12 13 14 18 19 1 2 3 4 5 21 22 IF IF RD RD FRD FRD ALU ALU FALU FALU MD MD FMUL FMUL FDIV FDIV MEM MEM FMEM WB WB FWB FWB FFWB FFWB Example:Concatenation and Pruning x x x x x x x x x x x x x x x x x x x tmax = 17 cycles S1 x x x x x x x x x x x x x x x x x x x x x x x x x x x x x FMEM x x x x x tmax = 22 cycles tmax = 14 cycles S2

1 2 3 4 5 13 14 15 16 17 IF RD FRD ALU FALU MD FMUL FDIV MEM FMEM WB FWB FFWB 20 18 19 1 2 3 4 5 21 22 IF RD FRD ALU FALU MD FMUL FDIV MEM WB FWB FFWB Example:Concatenation and Pruning 1 2 3 4 5 13 14 15 16 17 18 19 20 33 34 35 36 37 x x x x x x x IF x x x x x RD x x FRD x x x x x x x x ALU x x x x x x FALU x x x x x x x x x MD x x x FMUL FDIV x x x x x x x x x MEM FMEM x WB x x x FWB x FFWB x tmax = 37 cycles tmax = 17 cycles x x x x x x x x x x x x x x x FMEM x x tmax = 22 cycles

pruning 1 2 3 4 5 13 14 15 16 17 IF RD FRD ALU FALU MD FMUL FDIV MEM FMEM WB FWB FFWB 1 2 3 4 5 10 11 12 13 14 IF RD FRD ALU FALU MD head FMUL FDIV MEM tail FMEM WB FWB FFWB Example:Concatenation and Pruning x x x x x x x x x x x 22 23 24 25 26 1 2 3 4 5 13 14 15 16 17 x x x x x x x IF x x x x x x x x x x RD x x FRD x x x x x ALU x x x x x FALU x x x x x MD FMUL FDIV x x x x MEM tmax = 17 cycles x FMEM x x WB x FWB x FFWB tmax = 26 cycles x x x x x 26 cycles < 37 cycles - (5 cycles ( )+ 5 cycles ( )) x x x x x x x x x x x x tmax = 14 cycles

Combined PA Structure first_reference last_reference b4 b2 b3 b3 head tail x x x x x x x x x x x x x x x x x x x x texecution = 38 cycles

Extended Timing Schema • S: S1; S2 • S: if (exp) then S1 else S2 • S: while (exp) S1 where

Comparison with Original Timing Schema Original Timing Schema Extended Timing Schema timing element WCET bound Path Abstraction path concatenation +  path elimination pruning max

0 2 4 6 8 10 12 14 16 18 20 t1 t1,1 t1,2 t1,3 t1,4 t1,5 t2 t2,1 t2,1 t2,2 t2,2 t2,2 t3 t3,1 t3,1 t3,1 t Inter-task Analysis

Two Step Approach 1. Local (per-task) analysis for estimating # of useful cache blocks at each execution point 2. Global analysis for calculating the cache-related preemption delay based on the linear programming technique

m m m m m m m m m m m m 7 1 0 5 6 3 4 5 6 0 0 2 m 0 m useful cache blocks at point P 5 m 6 m 3 Local Analysis (1) • A cache block is useful if it contains a memory block that may be re-referenced before being replaced. • # of useful cache blocks at an execution point gives an upper bound on the cache-related preemption cost at that point.

Definitions c : set of memory blocks that may reside in cache block c at point p RMB p : set of memory blocks that may be the first reference to cache block c after point p c LMB p Local Analysis (2) • A useful cache block at point p is defined as a cache block whose RMBs and LMBs have at least one common memory block.

useful useful Local Analysis (3)

Preemption Cost Table Task Largest Preemption Cost • t1 • f1 Local Analysis of Each Task • t2 • f2 • t3 • f3 ... ... • tn • fn

Augmented Response Time Equation • Iterative Solving ... Global Analysis (1)

maximize subject to Global Analysis (2)

Limitations 0 20 40 60 80 100 120 main memory cache • Not all useful cache blocks are replaced. • Some preemptions are not feasible. t1 5 t2 t3 2 R3

Enhanced Approach • Uses two new features 1. Scenario-sensitive preemption cost 2. Additional constraints from task phasing

FFT LUD LMS FIR cache mapping 1 cache mapping 2 cache mapping 3 Experiments • Task set with 4 tasks • Three different cache mappings of tasks

Experimental Results (1)

Experimental Results (2)

Conclusions • Intra-task Analysis • Extended Timing Schema • PA (Path Abstraction) •  and pruning operations • Inter-task Analysis • Data Flow Analysis • Response Time Equation • Linear Programming Technique

Future Work • Data Cache Analysis • WCET Analysis for Advanced Architectures (Superscalar and VLIW) • I/O (DMA) Timing Analysis

http://archi.snu.ac.kr/symin/ Related Papers • S.-S. Lim et al. “An Accurate Worst Case Timing Analysis for RISC Processors,”IEEE Transactions on Software Engineering, 21(7):593-604, July 1995. • C.-G. Lee et al. “Analysis of Cache-related Preemption Delay in Fixed- priority Preemptive Scheduling,” IEEE Transactions on Computers, 47(6):700-713, June 1998.

Timing Analysis for Modern Architectures

Timing Analysis for Modern Architectures

Presentation Transcript

Timing Analysis

Online Timing Analysis for Wearout Detection

Timing Analysis

Modern DRAM Memory Architectures

STATIC TIMING ANALYSIS

Timing Analysis

Modern GPU Architectures

Timing Analysis

Timing Analysis

Timing Analysis Techniques

Timing Analysis

Bayesian analysis for Pulsar Timing Arrays

Timing Analysis and Timing Predictability

Compiler Optimizations for Modern VLIW/EPIC Architectures

Timing Analysis

Modern Application Architectures for COBOL Developers - An Introduction

Timing Model Reduction for Hierarchical Timing Analysis

Timing Analysis

Modern Valve Timing

Timing Analysis

Static Timing Analysis