310 likes | 365 Views
Investigating cache performance and memory access patterns impact on algorithm efficiency, using empirical and analytic methods to improve program runtime. Includes experiments, results, and cache design parameters.
E N D
Cache Miss Analysis of Walsh-Hadamard TransformAlgorithms Mihai A. Furis Advisor: Jeremy Johnson Ph.D. Department of Computer Science
Abstract Processor speed has been increasing at a much greater rate than memory speed leading to the so called processor-memory gap. In order to compensate for this gap in performance, modern computers rely heavily on a hierarchical memory organization with a small amount of fast memory called cache. The true cost of memory access is hidden, provided data can be obtained from cache. Substantial performance improvement in the runtime of a program can be obtained by making intelligent algorithmic choices that better utilize cache.In this work, we investigate cache performance using both empirical and analytic techniques. The goal is to better understand how algorithmic choices affect cache performance. Using cache simulators and hardware counters we compare cache performance for different memory access patterns, and use this data to model and analyze the cache behavior of a more complicated algorithm.
Objective, strategy and results • Objective Determine the relation between memory access patterns (algorithm) and the memory architecture. Develop a performance model to analyze and predict cache behavior for an algorithm. Improve and optimize algorithms by making intelligent choices based on the performance model. • Strategy 1) Investigate strided access patterns on different memory configurations. 2) Extend results to a more complicated algorithm (WHT). Experiments and Results 1) Measured runtime performance and cache misses of benchmark program. 2) Used simulator to investigate different memory configurations. 3) Measured performance and cache misses of the WHT algorithm(s) and developed a parameterized model to predict cache behavior. 4) Analyzed the cache performance of the WHT.
Outline • Part I • Review of cache design parameters • Tools for measuring cache performance • Performance counters, memory trace & simulator • Investigation of strided memory access patterns • Part II • Review family of WHT algorithms • Cache model for WHT • Investigation of cache misses for WHT
L1 Instruction Cache ITLB Processor L2 Unified Cache Main Memory L1 Data Cache DTLB Cache structure and organization
Cache Design Parameters • Cache size usually a power of two • Block size Smallest amount of data that can be transferred between memory and cache. Provides prefetching. The mapping between main memory and cache is done using the formula: (Block address) MOD (Number of blocks in the cache) • Associativity Provide a set of locations in cache which contain data that maps to the same cache block. Direct mapped, fully associative and in between. Mapping formula: (Block address) MOD (Number of sets in cache)
Three C model • Compulsory misses The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold startmisses or first referencemisses. • Capacity misses If the cache cannot contain all the blocks needed during execution of a program, capacity misses will occur due to blocks being discarded and later retrieved • Conflict misses If the block-placement strategy is set associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set.
The Stride program for (csize = CACHE_MIN; csize <= CACHE_MAX; csize = csize * 2) for (stride = 1; stride <= csize/2; stride = stride * 2) { sec = 0; /* initialize timer */ limit = csize - stride + 1; /* cache size this loop */ steps = 0; do { /* repeat until collect 1 second */ sec0 = get_seconds(); /* start timer */ for (i = SAMPLE * stride; i != 0; i = i - 1) /* larger sample */ for (index = 0; index < limit; index = index + stride) x[index] = x[index] + 1; /* cache access */ steps = steps + 1; sec = sec + (get_seconds() - sec0); } while (sec < 1.0); /* until collect 1 second */
The memory access pattern for the Stride program Memory location 1 2 3 4 5 6 7 …………………….. Stride 0 1 2 3
Performance Analysis Tools Performance Counters During the last years the microprocessors have been designed to include special hardware support for measuring and monitor their performance. The performance monitors interface that I used for this paper is called The Performance Counter Library, or PCL. More information about PCL can be found at the address: http://www.fz-juelich.de/zam/PCL/. The hardware support for performance measuring comes under the form of a set of performance counters with a defined set of countable events. The PCL interface allows us to initialize the set of counters we a specified set of events and record this events. At the end the interface allows us to retrieve the results. Cache Simulator We used the Dinero cache simulator to simulate the execution of the Stride program from different cache sizes and strides on a virtual machine similar with our lab machine n1-10-78 (Pentium III)
Machine Configuration Identification: GenuineIntel, Pentium III n1-10-78.mcs.drexel.edu Hardware: TLB Instruction 4K-Byte Pages, 4-way set associative, 32 entries 4M-Byte Pages, fully associative, two entries Data 4K-Byte Pages, 4-way set associative, 64 entries 4M-Byte Pages, 4-way set associative, eight entries L1 Cache Instruction 16K Bytes, 4-way set associative, 32 byte line size Data 16K Bytes, 2-way or 4-way set associative, 32 byte line size L2 Unified Cache 512K Bytes, 4-way set associative, 32 byte line size
The Walsh-Hadamard transform The Walsh – Hadamard Transform of a signal x, of size N = 2n is the matrix vector product WHTN * x where:
WHT factorizations • Recursive factorization • Iterative factorization • General Factorization
WHT algorithms R=N; S=1; for i=t,…,1 R=R/Ni forj=0,…,R-1 for k=0,…,S-1 S=S* Ni;
Space of WHT algorithms (partition trees) 7 7 2 5 2 2 2 1 2 3 2 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vector Breakdown StrategiesThe Interleaved Split 4 3 1 2 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Vector Breakdown strategiesThe Cut Split 4 1 3 1 2
Generating random WHT factorization trees 2 | 3 | 1 | 2 = [2, 3, 1 , 2] 1 1 1 1 1 1 1 1 . . . . . . . 0 1 0 0 1 1 0
Conclusions and Future Work • Developed model for counting the number of cache misses in a WHT algorithm. • Empirically investigated the number of cache misses for different WHT algorithms and cache parameters. Future Work • Theoretical understanding of max, min, avg., and distribution of cache misses for WHT. • Refine model to account for runtime. • Generalize to other algorithms.