630 likes | 868 Views
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors. Intel Software College. Objective. At the successful completion of this module, you will be able to
E N D
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Intel Software College
Objective • At the successful completion of this module, you will be able to • Use the VTune™ Performance Analyzer to identify micro-architectural bottlenecks in software running on Intel® Core™ 2 Duo Xeon® processors • Address the performance bottleneck for Intel® Core™ 2 Duo Xeon® processors Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Agenda • Core® micro-architecture review • Event basics • Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Shared L2 = 4MB CPU-0L1D=32KB CPU-1L1D=32KB L0/L1 DTLBPMH L0/L1 DTLBPMH CPU-0L1I=32KB CPU-1L1I=32KB CPU-0Core CPU-1Core Next Generation Micro ArchitectureIntel® Core™ 2 Duo Processor FSB Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Architecture Block and Instruction Flow To L2 Cache/Memory Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Disclaimer: This block diagram is for example purposes only. Significant hardware blocks have been arranged or omitted for clarity. Some resources (Bus Unit, L2 Cache, etc…) are shared between cores. IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Agenda • Core® micro-architecture review • Event basics • Events identifying Intel® Core™ 2 Duo Xeon® processors bottlenecks • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsEvents Versus Samples • A performance counter increments on the CPU every time an event occurs • A sample of the execution context is recorded every time a performance counter overflows • Events = samples * sample after value Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsRetired Versus Non-Retired Events • Retired events include only events that occur due to instructions that are committed to the machine state. • For example, when measuring the Loads Retired event, a load that occurs on a mispredicted execution path is not counted • Most retired events can also be precise events. • No event skid Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsEvent Skid • On Pentium® 4 and Intel Xeon™ processors, events can appear a few lines after they actually occur in the disassembly source view, which is due to interrupt latency. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsPrecise Events • Do not suffer from event skid • Use hardware to record the address where the event occurs • Reduce the number of events you can collect at once Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsPrecise Events (cont.) On: Off: Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsEvent Ratios • Calculate common processor performance metrics • Built in to VTune™ analyzer Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsClockticks and Instructions Retired • Clockticks measure CPU cycles • Clockticks/processor frequency = time in seconds • Instructions retired = the number of instructions committed to the processor state (executed completely) • Cycles per instruction (CPI) = clockticks / instructions retired High CPI usually indicates opportunities for optimization. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
VTune™ Analyzer Event BasicsClockticks Versus Non-halted Clockticks • Clockticks = halted + non-halted cycles (but no sleep cycles) • The clockticks event measures cycles when the physical processor is not in any sleep modes. • The non-halted clockticks event measures the cycles that a logical processor is not asleep or halted. • If you measure clockticks on a Hyper-Threaded technology-enabled system while running a single-threaded application, you will see a lot of samples around the halt instruction in processor.sys. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Agenda • Core® micro-architecture review • Event basics • Performance tuning for Intel® Core™ 2 Duo Xeon® processors • Events for performance • Performance optimization methodology • X86 cycle accounting • Summary Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Events along Uop Flow (1) To L2 Cache /Memory Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Memory Access • Latencies • L1 miss hits L2 ~ 10 cycles • L2 miss, access to memory ~300 cycles (server/FBD) • L2 miss, access to memory ~165 cycles (Desk/DDR2) • Cache Bandwidth • Bandwidth to cache ~ 8.5 bytes/cycle • Memory Bandwidth • Desktop ~ 6 GB/sec/socket (linux*) • Server ~3.5 GB/sec/socket Performance Counters on Intel® Core™ 2 Duo Xeon® Processors * Other names and brands may be claimed as the property of others.
Performance Events for the Front End Memory BW = 64*Bus_Trans_Mem*freq/Cpu_Clk_Unhalted Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Lab Activity 1:Calculating the Memory Access Bandwidth • In this lab, you will calculate the bandwidth of memory with the performance counter events using the VTune™ analyzer Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Events along Uop Flow (2) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Resource_Stalls measures here transfer from Decode IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Events of Resource _Stalls • Uop flow to OOO engine blocked by downstream cause • Resource_Stalls.BR_MISS_CLEAR • pipeline stalls due to flushing mispredicted branches • Combine in Resource_stalls.CLEAR • Mispredicted branch followed by fp inst • Resource_Stalls.ROB_FULL • 96 instructions in ROB • Resource_Stalls.LD_ST • All Store or Load buffers in use • Resource_Stalls.RS_FULL • 32 instructions waiting for inputs in Reservation Station Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Measuring Instruction Starvation • There really is no good way to do this • Anti Correlate with Resource_stalls.RS_full • There could be • Cycles Decode queue is empty • Cycles RS is empty • Cycles ROB is empty Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Events along Uop Flow (3) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Rs_uops_dispatched measures at Execution IA Register Set Other stalls measures at Execution Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Measuring Efficiency in the Execution Stage • OOO engine optimizes instruction issue to functional units from Reservation Station • They wait there until their inputs are available • RS_UOPS_DISPATCHED measures number of uops dispatched from RS on each cycle • There are chains preventing OOO engine from executing in parallel • Partial Register Stall • Partial Flag Register Stall • Domain bypass • Others… Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Events along Uop Flow (4) To L2 Cache Fetch / Decode Execute Bus Unit 32 KBData Cache 32 KBInstruction Cache Next IP FP Add SIMD Port IntegerArithmetic Branch Target Buffer InstructionDecode (4 issue) FP Div/Mul IntegerShift/Rotate SIMD Port IntegerArithmetic Microcode Sequencer SIMD Port Reservation Stations (RS)32 entry IntegerArithmetic Scheduler / Dispatch Ports Load Port Register Allocation Table (RAT) MemoryOrderBuffer(MOB) Store Addr Port Retire Store Data Re-Order Buffer (ROB) – 96 entry Port Uops_retired measures at Retirement IA Register Set Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Retirement vs Dispatch • Which Function to work on first? • For loops, difference is due to OOO execution • Fewer False Positives When “Stalls” Are Measured at Dispatch Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Performance Optimization Methodology • This style of optimization has 2 components • Minimizing instruction count (path length) • A sort of “tree height” minimization • Minimizing deviations from ideal execution • Generically thought of as “stall cycles” • Treating both equally is critical Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Stalls, Execution Imperfection and Performance Analysis • Stall cycles are used to indicate less than perfect execution • An architectural decomposition of “stalls” can be used to guide the selection of architectural events • The IP correlation of “stalls” and arch events then guides the optimization effort • Stalls have 4 basic components in x86 • Front End stalls • Execution stage instruction starvation (Front End) • Mispredicted branch pipeline flushing • Execution stalls • (Waiting on input/Scoreboard, L2 miss, BW, DTLB, glass jaws etc) • Cycles wasted executing instructions that are not retired Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Improve Optimization to Reduce Instruction Count,Split Loops to Increase ILP Reduce Branch MispredictionsPGO Traditional Stall Removal Resource_stalls.br_miss_clear will estimate stalls due to Pipeline Flush X86 Cycle Accounting and SW Optimization • Cpu_clk_unhalted = “stalls” + dispatch = “stalls” + non_ret_dispatch + ret_dispatch Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Cycle Accounting on X86 • Cycles = “stalls” + dispatch • An equality by definition • Cycles ~ CPU_CLK_UNHALTED.CORE • For cpu intensive applications/sampling • Stall Cycles = Cycles with NO uops Dispatched = RS_UOPS_DISPATCH.CYCLES_NONE • Dispatch Cycle=RS_UOPS_DISPATCH Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Cycle Accounting on X86 (cont.) • Dispatch ~ cycles_dispatch_retiring_uops + cycles_dispatch_non_retiring_uops • Assumes no overlap of retired/non retired uops • Worst Case Senario • Non retired uops = rs_uops_dispatched – (uops_retired.any + Uops_retired.fused) • Non retired uop cycles ~ non retired uops/avg_uops_per_cycle • Fractional Wasted Work = rs_uops_dispatched / (uops_retired.any + uops_retired.fused) - 1 Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Pulling Cycle Accounting Together Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Decomposing Stalls: Elephants First Pipeline Flush = Resource_Stalls.Br_Miss_Clear/cyclesL2 Hits = ( MEM_LOAD_RETIRED.L1D_LINE_MISS - MEM_LOAD_RETIRED.L2_LINE_MISS )* 10/cyclesDTLB/L2 Miss = event count* penalty/cyclesFE + Scoreboard = Stalls – all of the above Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Decomposing Unstalled Cycles Non_Retired = (( 1 – (Uops_retired.any+Uops_retired.fused)/RS_Uops_Dispatched) * RS_Uops_Dispatched.Cycles_None / CPU_CLK_UNHALTED.CORE OOO Bursts = Uops_Retired.Any - Stalls – Non_Retired Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Pulling it All Together Risks Over-counting / Minimizing FE + Scoreboard But Offers a Guide to Execution Inefficiencies Illustrative Example Only, Not Real Data Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
The “Big 4” Events for Performance • CPU_CLK_UNHALTED.CORE • RS_UOPS_DISPATCHED.CYCLES.NONE • MEM_LOAD_RETIRED.L2_LINE_MISS • BUS_TRANS_ANY.SELF CYCLES, STALLS, UNPREFETCHED LOADS and BANDWIDTH Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Architectural Pitfalls: The Ants Contribute to “FE + Scoreboard”And don’t forget Micro-Fusion, Macro-fusion, etc.. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
A Heuristic Break-down for Stall Analysis Stalls? the “Big 4 (L2 cache)”, L1D cache …… Front End Stalls Register related, Domain related Exe Unit Stalls …… Retirement Efficiency And others …… Instructions decoding, LCP… …… Resource Stalls RS related and RAT related Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
A Heuristic Break-down for Stall Analysis (cont.) Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Lab Activity 2:Using SW tool to reduce the instruction counts • In this lab, you will practice the use of Intel compiler vectorization switch to reduce the instruction counts. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Lab Activity 3:Addressing the performance bottleneck in Front End • In this lab, you will identify and address the performance issue caused in the Front End of the processor by the “Big 4” events analysis. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Lab Activity 4:Addressing the performance bottleneck in Execution Core • In this lab, you will identify and address the performance issue caused in the execution core of the processor. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
A Loop Methodology • Identify hot functions and raise optimization • Fix alignments, split loops to enhance vectorization • Identify BW limited functions • Merge BW loops with FP limited loops • Identify L2 misses and add sw prefetch • Optimize flow through OOO Engine • Use loop splitting to assist here Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
More Detailed Event Selection Hierarchy SAV values selected so ratio of samples ~ absorbs penalty Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
More Detailed Event Selection Hierarchy (cont.) SAV values selected so ratio of samples ~ absorbs penalty EX: L1 miss/L2_hit penalty is 10 cycles Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Summary • Utilize CoreTM micro-architecture for software performance • Front end • OOO execution core • Use the VTune™ analyzer to identify micro-architectural bottlenecks in your software. • Use a cycles accounting methodology to improve the performance. Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
Micro-Architecture Comparison ++ Cedar Mill/Dempsey ** NGMA = Next Generation Micro-Architecture (Conroe/Woodcrest) = per core Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
FP Add SIMD Port 1 IntegerArithmetic FP Div/Mul IntegerShift/Rotate SIMD Port 0 IntegerArithmetic FP Add/ Mul/Div IntegerShift/Rotate SIMD SIMD IntegerArithmetic SIMD Port Port 5 IntegerArithmetic Integer Multiply Port 2 IntegerArithmetic Load Port 2x Core Freq Port 4 Store Execution Unit Comparisons NGMA Intel NetBurst® Micro-Architecture Performance Counters on Intel® Core™ 2 Duo Xeon® Processors
L2 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Miss L1 $ Hit, L1DTLB Hit DTLB Structure Disclaimer: Data is from a pointer chasing microbenchmark and for illustrative purposes only Performance Counters on Intel® Core™ 2 Duo Xeon® Processors