Jae W. Lee and Krste Asanovic { leejw, krste } @mit

METERG: Measurement-Based End-to-End Performance Estimation Techniquein QoS-Capable Multiprocessors Jae W. Lee and Krste Asanovic {leejw, krste}@mit.edu MIT Computer Science and Artificial Intelligence Lab April 5, 2006

Our focus: Running soft real-time apps (e.g. media codecs, web services) on general-purpose MPs Assuming multiprogrammed workloads: real-time + best-effort apps Problem: Contention for shared resources causes wider variationof a task’s execution time. Multiprocessors Are Ubiquitous Embedded CMP (e.g. ARM MPCore) Desktops P1 Pn Handheld/ Embedded Shared bus I/O MEM Servers

For fixed input to fixed task, execution time varies depending on the activities of other processors because of shared memory system. Execution Time Variation in MPs Execution time Processor 1 Processor n Severe Contention No Contention Shared Bus Shared Memory Shared Resource Instance # of TaskX i+2 i+4 i+1 i+3 i

Execution time increases by 85 % when number of background processes increases from 0 to 7. How can we limit the impact of resource contention?  QoS Support Impact of Resource Contention Measure Execution Time Variable Number of Background Processes Normalized Execution Time of memread on P1(a fixed input) 1.85 Processor 2 Processor 1 Processor 8 1.05 1.02 1.00 Shared Bus Shared Memory Shared Resource Number of Background Processes

Memory QoS guarantees per-processor memory bandwidth and latency. It guarantees minimum performance and distributes unclaimed bandwidth to sharers in order to maximize throughput. Challenge: How can we find the minimal QoS parametersthat meet a RT task’s given performance goal? Minimal reservation improves total system throughput. Memory QoS Reduces MP Execution Time Variation Range of Execution Time Without QoS With QoS Instance # of TaskX i+1 i+2 i i+3 i+4

User’s performance goal (user metric) should be translated into QoS parameters (system metric). User metric: Execution time, Transactions per Second (TPS), Frames per Second (FPS) System metric: Memory bandwidth, Latency Example: MP server virtualization Question: What guarantees can you make to users? Users prefer user metric (e.g. TPS) to system metric (e.g. Memory bandwidth). Translating Performance Goal in User Metric into QoS Parameters Partition 1 Partition 2 Partition n • Minimal QoS parameters are important to maximize system throughput. Multiprocessor System Substrate

Static timing analyses: Tools for WCET analyses Consist of Hardware modeling + program analysis [+] Can find minimal resource reservation for all possible inputs [-] Analysis cost is prohibitive. Becoming harder for increasing hardware/software complexity in MPs Possibly overkill for soft RT apps: Our primary concern is performance degradation from resource contention (not from input data). We can toleratea small number of deadline violations to maximize overall system throughput. Finding Minimal Resource Reservation (1): Analysis

Measurement-based Approach Measures task’s execution time for a certain input set Motto: “The best model for a system is the system itself.” [Colin-RTSS ’03] Goal: To find execution time under worst-case contention (=TMAX(x, i)) for given QoS parameter (x) and instruction sequence (i) Finding Minimal Resource Reservation (2): Measurement Probability distribution (x: QoS parameter) Range of variation TMAX(x, i) Execution Time for fixed inst seq i Degree of Contention Low ← → High

Measurement-based Approach (Cont) [+] Easy [-] Not absolutely safe Can be practically useful for soft real-time apps  Can be useful if we are given “near-worst” input set [-] Measured time varies by degree of contention in runtime  We will address this problem in the rest of this talk! Finding Minimal Resource Reservation (2): Measurement Probability distribution (x: QoS parameter) Range of variation TMAX(x, i) Execution Time for fixed inst seq i Degree of Contention Low ← → High

METERG QoS System: Improving measurement-based approach Estimates TMAX(x, i) easily by measurement Provides hardware support for safe and tightestimation for TMAX(x, i) Introduces two QoS modes Deployment mode (operation mode) Enforcement mode (measurement mode for estimation) Our Approach: METERG Probability distribution 1. Always greater than TMAX(x, i) (Safe) Range of variation (Deployment) Range of variation (Enforcement) 2. Very narrow and close to TMAX(x, i) (Tight) Execution Time for fixed inst seq i TMAX(x, i)

Now each QoS block supports two operation modes: 1) Deployment mode: Conventional QoS mode (Operation) QoS parameters are treated as a lower bound of received resources in runtime. 2) Enforcement mode: New QoS Mode (Estimation) QoS parameters are treated as an upper bound of received resources in runtime. 50 50 Modifying Shared Blocks to Support Two QoS Modes QoS Parameter 1) Deployment Mode 2) Enforcement Mode

Single-threaded apps; no shared objects. No contentions within a node. (e.g. multithreaded processors) No preemption of a scheduled task. Negligible impact of initial states on execution time. Assumptions

Baseline MP System • Simple fixed frame-based scheduling for memory access • Example (x1=0.2, xn=0.3) • xi determines • Minimum bandwidth • Worst-case latency Processor 1 Processor n x1 xn Frame (size = 10 Timeslots) Shared Bus 1 n n 1 n Shared Memory QoS-capable Shared block Not taken: Given to a requester by round-robin fashion Taken By P1 xi: Fraction of time slots reserved for Processor i

Example (x1=0.2) 0.2 0.2 Runtime Usage Example in METERG System Reservation: Processor 1 Processor n Frame (size = 10 Time Slots) 1 1 x1 xn Shared Bus Runtime usage (Deployment): Shared Memory QoS-capable Shared block Not reserved but used Reserved & used Runtime usage (Enforcement): X X X X X X X X Reserved but unused Not allowed to use Reserved & used

How to find minimal resource reservation with METERG: In measurement phase: Step 1: Measure performance in enforcement mode for a QoS parameter. Step 2: Iterate Step 1 to find a minimal QoS parameter (yet meeting the performance goal). Step 3: Store the minimal QoS parameter. In operation phase: Step 4: Execute the program in deployment mode with the stored QoS parameter for guaranteed performance. Obtain Minimal Resource Reservation Using METERG

Shared bus example again (x1=0.2) Bandwidth inequality is guaranteed but not latency. BandwidthDEP (xi) ≥ BandwidthENF (xi) (O) MAX [LatencyDEP (xi)] ≤ MIN [LatencyENF (xi)] (X)  May cause end-to-end performance inversion. Bandwidth Guarantees Are Not Enough: An Example Deployment Mode Enforcement Mode Memory Request Memory Request 4 Timeslots 2 Timeslots X X X X X X X X Used by other processors Not allowed to use

Relaxed enforcement (R-ENF) mode: Bandwidth-only guarantees BandwidthDEP (xi) ≥ BandwidthR-ENF (xi) There is a (small) chance for observed execution time in enforcement mode to be smaller than TMAX(x, i). (Not Safe) Strict enforcement (S-ENF) mode: Bandwidth and latency guarantees BandwidthDEP (xi) ≥ BandwidthS-ENF (xi) && MAX[LatencyDEP (xi)] ≤ MIN[LatencyS-ENF (xi)] Observed execution time is always greater than TMAX(x, i) in enforcement mode as long as there is no timing anomaly in processor.(Safe) Estimation of TMAX(x, i) is looser. Two Enforcement Modes: Safety-Tightness Tradeoff

Delay queue In Deployment mode: Bypassed In Enforcement mode: Deferring delivery to processor for “lucky” messages Memory System with Strict Enforcement Mode: An Implementation METERG QoS Memory block Processor 1 OpMode1 Mem_Req Mem_Reply TMAX(DEP)(x1): Worst-Case latency in Deployment Mode for given param x1 Tactual1: Actual time taken to service request 1 Delay Queue (bypassed in DEP mode) Processor n Data Timer V 1 data1 TMAX(DEP)(x1)-Tactual1 1 data2 TMAX(DEP)(x1)-Tactual2 0

Simulation Setup 8-way SMP running Linux In-order core running with 5x faster clock than the system bus Detailed memory contention model with a simple fixed-frame TDM bus 32 KB unified blocking L1 cache / No L2 cache Synthetic benchmark: Memread Infinite loop accessing a large chunk of memory sequentially Performance bottlenecked by memory system performance Evaluation: Simulation Setup Processor 1 Processor 8 Request Reply Request Reply OpMode8 OpMode1 E D E D Network Interface (NI) Shared Bus Shared Memory METERG QoS block

Evaluation: Varying Number of Processes Measure P1 (x1 = 0.25) Variable # of BE Processes Normalized Execution Time 1.85 (BE) Processor 2 Processor 1 Processor 8 1.45 (ENF) 1.10 (DEP) Shared Bus Shared Memory 0 3 7 1 Shared Resource # of Background Processes (x1 =0.25) • Execution time degradation as number of background processes increases from 0 to 7 without QoS (Best-Effort): 85 % with QoS (Deployment): 10 % • Estimated execution time upper bound for given QoS parameter (x1=0.25): 45 %

Evaluation: Varying QoS Parameters Measure P1 (Variable x1) 7 BE Background Processes Normalized Execution Time Processor 2 Processor 1 Processor 8 2.35 (ENF) 1.51 (DEP) Shared Bus 1.09 1.01 Shared Memory Shared Resource QoS Parameter (x1) • Performance estimation becomes tighter as QoS parameter (x1) increases. • Because the worst-case latency in accessing memory decreases accordingly

Performance impact on a QoS process by other QoS processes is negligible (<2%) as long as system is not oversubscribed. Evaluation: Varying Number of QoS Processes Estimated lower bound for xi=0.25 Normalized IPC* 0.91 0.91 0.90 0.91 0.90 0.89 0.81 (0.69) (0.54) 0.45 0.37 8-BE Case 0.27 0.14 1 QoS + 7 BE 2 QoS + 6 BE 3 QoS + 5 BE 4 QoS + 4 BE QoS Parameters: (0.25, 0, 0, 0, 0, 0, 0, 0) (0.25, 0.25, 0, 0, 0, 0, 0, 0) (0.25, 0.25, 0.25, 0, 0, 0, 0, 0) (0.25, 0.25, 0.25, 0.20, 0, 0, 0, 0) (* Higher Number = Better Performance)

In general-purpose MPs, shared resource contention causes large variation of execution time of a task. (~2x increase observed in 8-processor case) METERG QoS System provides an easy way (by measurement) to estimate execution time under worst-case resource contention for a given QoS parameter. Introducing a new QoS mode (Enforcement Mode) for performance estimation Using METERG QoS System, minimal QoS parameter can be easily found for given instruction sequence and performance goal. Conclusion

Jae W. Lee and Krste Asanovic { leejw, krste } @mit

Jae W. Lee and Krste Asanovic { leejw, krste } @mit

Presentation Transcript

Krste Asanovic Speaker: David Biancolin Electrical Engineering and Computer Sciences

Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley

Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley

GrabCut Interactive Image (and Stereo) Segmentation Joon Jae Lee Keimyung University

Emmett Witchel Krste Asanović MIT Lab for Computer Science

Jae-Gil Lee Department of Knowledge Service Engineering KAIST

Jae Kim

Seok-jae , Lee VLSI Signal Processing Lab. Korea University

Cargo Preference and Restrictions Applying to Specific Trades Name : Jae-sun Lee

Krste Asanovic krste@eecs.berkeley inst.eecs.berkeley /~cs252/sp14

Krste Asanovic krste@eecs.berkeley inst.eecs.berkeley /~cs252/sp14

Quantum Spin Liquid Patrick Lee MIT

Jae-Weon Lee* (Jungwon Univ.) Jungjai Lee (Daejin Univ.) Hyungchan Kim (Chungju Univ.)

Jae W. Lee (MIT) Man Cheuk Ng (MIT) Krste Asanovic (UC Berkeley)

Jongmin Han † , Hyun Yim † , Soekho Son † , Jae-Shin Lee ‡

David W. Lee and J. Brian Lowry

Jae Young Lee, Christine E. Schmidt

Emmett Witchel Josh Cates Krste Asanovic MIT Lab for Computer Science

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović

Mark Hampton and Krste Asanović April 9, 2008

by W K Lee

Jae Lee English A August 25, 2008