430 likes | 588 Views
Instruction Cache Memory Issues in Real-Time Systems. Licentiate dissertation Filip Sebek October 11 th , 2002 Opponent: Axel Jantsch (KTH) Examinator : Lars Wanhammar (LiTH). Outline of this dissertation. Seminar About this thesis ( Lennart Lindh )
E N D
Instruction Cache Memory Issuesin Real-Time Systems Licentiate dissertation Filip Sebek October 11th, 2002 Opponent: Axel Jantsch (KTH) Examinator: Lars Wanhammar (LiTH)
Outline of this dissertation • Seminar • About this thesis (Lennart Lindh) • Thesis presentation (Filip Sebek) • Comments and questions (Axel Jantsch and Filip Sebek) • Questions from the audience • Consideration (Lars Wanhammar, Axel Jantsch, and Lennart Lindh) • Festivity (?) at the department
Organisation RT Systems Design Lab Comp. Architecture Lab Computer Science Lab Graduate Education Lic school Int’l MSc school Undergraduate Education
Stefan Sjöberg: Design ASIC/FPGA with Top Down Design Flow and VHDL (RealFast ABB) Leif Enblom (ABB APR): Multiprocessor system for (ABB KK) Joakim Persson: Redundant System (ProTang, KK) Mohammed El Shobaki: System Monitoring/Debugging of S/Multiprocessor Systems Stefan Stjernen: IP Design (RealFast, Industrial ResearchSchool: Electronic Design ) Johan Stärner: Multiprocessor Architecture (KK) Tommy Klevin: Bus analyzer (RealFast) Filip Sebek: Instruction Cache Memory Issues in RTS Raimo Haukilahti KTH/MDH: Low-Power Techniques for HW-RTOS (KTH)
The title and the questions • Title:Instruction Cache Memory Issuesin Real-Time Systems • Initial questions • How do I measure the cache-related preemption delay in a real-time system? • Is a cache memoryreally a problem in real-time systems?
Automatic control – Real-time system • A real-time system must produce correct results in time • Examples • Air bag in action • An armored tank in movement shoots • Supertanker turns • Toaster • Get input – sample… • Compute – execute instructions • Actuate – control the process… • = Action!
Real-time system implementation • Often as many ”small” cyclic programs – tasks or processes – that communicate with each other Alarm task Computation Sample task Actuate
What Real-Time research is about: • Predicting execution time (of a task) • Difficult – Many parameters • Input data sensitive • Program design • Hardware dependant • Compiler dependent • Several methods • Scheduling tasks • static or dynamic • may allow pre-emption
The title and the questions • Title:Instruction Cache Memory Issuesin Real-Time Systems • Initial questions • How do I measure the cache-related preemption delay in a real-time system? • Is a cache memoryreally a problem in real-time systems?
Fast (~95%) I/O CPU CACHE MEM Slow (~5%) What is a cache memory? • Cache memories are faster than primary memory and keeps pace with CPU speed • Reduce congesting bus-traffic • Saves energy • Instruction fetch time becomes variable with caches; hit-time and miss-penalty
How does a cache memory work? • Cache hit and cache miss • Locality • Temporal locality; • memory references close in time • loops and functions • Spatial locality; • memory references close in space • cache block and wide data bus int funk(int term) { int vector[SIZE]; int i, sum=0; for(i=0;i<SIZE;i++) { vector[i] +=term; sum +=vector[i]; } return sum; }
The title and the questions • Title:Instruction Cache Memory Issuesin Real-Time Systems • Initial questions • How do I measure the cache-related preemption delay in a real-time system? • Is a cache memoryreally a problem in real-time systems?
Cache memories and real-time • Cache memories make execution time variable • Sample, execute, actuate – action! • Sample, execute, actuate – action! • Sample, execute, actuate – action! Missed deadline? • Analysis is non-trivial; • cache contents depends on execution path • execution path depends on cache contents
Predicting cache behavior • Avoidance and simplifications • Disable cache! • Special designed processors and caches • Static analysis • + no probe effects • + safe overestimation • - modern hardware • (Paper C) • Simulation • + simple • - simulator must model correctly • Real measurement • + measure on complex systems • - probe effect • (Papers A, B, D)
The title and the questions • Title:Instruction Cache Memory Issuesin Real-Time Systems • Initial questions • How do I measure the cache-related preemption delay in a real-time system? • Is a cache memoryreally a problem in real-time systems?
Measurement and probe effect • Most measurement affect the measured object when included or removed from the measured environment. • Examples: • A warm thermometer measures a glass of cold water • A computer monitoring system measures CPU load Reduce the intrusion (probe effect) to a minimum!
The Built-in Performance Monitor • Exploit the performance monitor that is equipped on CPU • 4 registers on MPC750 • Counts events • L1 Instruction fetch miss • Branch miss • Processor clocks • Completed instructions • Completed Load/Stores • … NON-INTRUSIVE !
My questions revised • Initial questions: • How do I measure the cache-related preemption delay in a real-time system? • Is a cache memory really a problem in real-time systems? • Modified questions: • Is there a simple(r) way to predict or measure cache misses in a real-time system? • Can an instruction cache cause a missed deadline when it is enabled? • How much is the cache-related pre-emption delay in absolute and relative terms?
Outline of this presentation • Introduction • The cache memory and real-time • Measurement and probe effect • CPX2000 – “SARA system” • My own questions • Synthetic code generation • Analysis • Determine worst-case cache miss-ratio of a program • Measure instruction execution time w/wo cache • Measure cache related preemption delay • Conclusion and future work
Current state in presentation: • We have 3 questions! • We have an experimental system! • We can measure on it with a small intrusion! • Q: Measure on what program?
Code generation: size • Workbench • Standard benchmark? (Rhealstone, EEMBC etc.) • Measure worst-case situations • Synthetic code – size specific • One big loop • addis r3,r3,0x0000 = 4 bytes • Not representative code – no problem! • Swap out cache contents – find maximum cost • Code size measured in “cache size”
Code generation: miss-ratio • One (out of several methods) • ”Play with spatial locality” • Method: Jump instructions breaks spatial locality • Requirements: code size 2×cache size • Result: 1/block size – 100% cache misses L1: nop(m) nop(h) nop(h) nop(h) L2: nop(m) nop(h) nop(h) nop(h) L1: J L2(m) n.u. n.u. n.u. L2: J L3(m) n.u. n.u. n.u. L1: nop(m) J L2(h) n.u. n.u. L2: nop(m) J L3(h) n.u. n.u. L1: nop(m) nop(h) J L2(h) n.u. L2: nop(m) nop(h) J L3(h) n.u. 25% 100% 50% 33%
Block size = 8 words i1 miss 1/4 i1 miss 1/6 i2 hit 1/4 i2 hit 1/6 i3 hit 1/4 i3 hit 1/6 i4 hit 1/4 i4 hit 1/6 i5 miss 1/2 i5 hit 1/6 beq 10 hit 1/2 beq 10 hit 1/6 i7 - - i7 - - i8 - - i8 - - i9 - - i9 - - i10 miss 1/3 i10 miss 1/4 i11 hit 1/3 i11 hit 1/4 i12 hit 1/3 i12 hit 1/4 jmp 18 miss 1/1 jmp 18 hit 1/4 i14 - - i14 - - i15 - - i15 - - i16 - - i16 - - 2/10 = 20% miss-ratio 1.Code interpretation: miss-ratio (reversed process to generate code with a fix miss-ratio) Block size = 4 words 4/10 = 40% miss-ratio
i1 miss 1/4 i2 hit 1/4 i3 hit 1/4 miss 1/4 i4 hit 1/4 hit 1/4 i5 miss 1/2 hit 1/4 beq 10 hit 1/2 hit 1/4 i7 - - miss 1/2 i8 - - hit 1/2 i9 - - hit 1/4 i10 miss 1/3 hit 1/4 i11 hit 1/3 miss 1/4 i12 hit 1/3 miss 1/3 jmp 18 miss 1/1 hit 1/3 i14 - - hit 1/3 i15 - - miss 1/1 i16 - - hit 1/4 hit 1/4 hit 1/4 1.Code interpretation: miss-ratio (reversed process to generate code with a fix miss-ratio) Line size = 4 words
1.Code interpretation: miss-ratio • Determine the worst-case cache miss-ratio (WCCMR) • The highest frequency of misses possible for a program! • Depends on execution path (actually input data) > Miss% < Miss% • The WCCMR-path is the most energy consuming! • Optimize for • Speed or Size • Energy consumption
1.Key concepts bounding WCCMR • Spatial locality analysis • Determine instruction’s ”local miss-ratio” • Execution path analysis • Determine the weight of each basic block (loop dependent) • Search • Find the execution path with the highest cache miss-ratio
1.Result (finding WCCMR) ... if(a>b) { ... ... do{ ... }while(c>d); } else { ... ... while(e<3){ ... } } ... max !! (1) (2) (3) (4) (5) (6)
Outline of this presentation • Introduction • The cache memory and real-time • Measurement and probe effect • CPX2000 – “SARA system” • My own questions • Synthetic code generation • Analysis • Determine worst-case cache miss-ratio of a program • Measure instruction execution time w/wo cache • Measure cache related preemption delay • Conclusion and future work
2.When is a cache memory beneficial? • On cache misses, the complete cache block is loaded • If cache block > instruction size miss-penalty • A cache can reduce system performance! • High miss-ratio AND long miss-penalty • Experiment: • Generate code with fix miss-ratio • Measure time • Plot the average execution time
2.Threshold miss-ratio level (@CPX2000) Threshold-level (84%) Cache enabled Cache miss-ratio (%) Cache disabled Execution time (ns/instruction)
I/O I/O CPU CPU CACHE CACHE MEM MEM 2.When is a cache memory beneficial? • Concluding question: • “When is instruction caching beneficial?” • Answer: • ”Always” (!!) • “No code is so jumpy” • “No missed deadlines” • “Safe!” • (New Q&As) • ”Why 84% miss?” • ”Low refill penalty” • ”Why?” • ”Burst refill!” Refill block HIT Request MISS! Request
Outline of this presentation • Introduction • The cache memory and real-time • Measurement and probe effect • CPX2000 – “SARA system” • My own questions • Synthetic code generation • Analysis • Determine worst-case cache miss-ratio of a program • Measure instruction execution time w/wo cache • Measure cache related preemption delay • Conclusion and future work
3.Cache Related Preemption Delay Miss-ratio T1 T2 Time Miss-ratio T1 T2 T1 Time T2 preempts T1 T1 resumes • Extrinsic cache behavior - Task interference • Non-preemptive systems • Preemptive systems • Cache Related Preemption Delay - CRPD
non-preempted preempted i4 (cont.) i4 iteration 1 iteration 2 i3 Miss-ratio T1 T2 T1 Time T2 preempts T1 T1 resumes 3.CRPDmax measurement
non-preempted preempted OS:43-87 s 915399425 922751625 921219825 921592925 918791225 3.CRPDmax measurement CRPD = ((e - d) + (c - b)) – (b - a) = 195 500 ns = 195,5 s
3.CRPD (@CPX2000) 195,5 s CRPD (micro seconds) T1 Task size (cache size %)
Conclusions and summary of results • The worst-case cache miss-ratio of a program can be identified to quantify the energy usage of the memory system • The CPX2000 system cannot miss any deadline because of an enabled instruction cache. • Synthetic workbenches can force a system into a worst-case state • The cache related preemption delay has been measured as a function of task size.
Future Work • None! • Develope the analysis method of worst-case cache miss-ratio levels • by including temporal locality • Data caches • (Generate synthetic code) • Measure CRPD • Measure threshold miss-ratio level
Acknowledgements • Research was funded by • KK-stiftelsen • Department of Computer Science and Engineering (Mälardalen University) • Thank you… • Supervisor Professor Dr. Ing. Lennart Lindh • All people at the Computer Architecture Lab • My family