180 likes | 319 Views
National Sun Yat-sen University Embedded System Laboratory A Proposal for a New Hardware Cache Monitoring Architecture. Presenter: Chien-Chih Chen. Proceedings of the 2002 workshop on Memory system performance. Abstract.
E N D
National Sun Yat-sen University Embedded System LaboratoryA Proposal for a New Hardware Cache Monitoring Architecture Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance
Abstract The analysis of the memory access behavior of applications, an essential step for a successful cache optimization, is a complex task. It needs to be supported with appropriate tools and monitoring facilities. Currently, however, users can only rely on either simulation based approaches, which deliver a large degree of detail but are restricted in their applicability, or on hardware counters embedded into processors, which allow to keep track of very few, mostly global events and hence only provide limited data.
Abstract (cont.) In this work a proposal for novel hardware monitoring facility is presented which exhibits both the details of traditional simulations and the low-overhead of hardware counters. Like the latter approach, it is also targeted towards an implementation within the processor for a direct and non- intrusive access to caches and memory busses. Unlike traditional counters, however, it delivers a detailed picture of the complete memory access behavior of applications. This is achieved by generating so-called memory access histograms, which show access frequencies in relation to the applications address space. Such spatial memory access information can then be used for efficient program optimization by focusing on the code and data segments which were found to exhibit a poor cache behavior.
What’s Problem • Existing analysis of the memory access behavior of application approaches can roughly divided into two classes which include simulation and hardware counters. • Tracing large application can significantly slow down simulation due to the huge amount of memory operation. • The counter approaches are restricted very few individual counters in each processor.
Related Works [1] [2] A sampling based system DCPI and its extension Utilize the information available from hardware counters to combine histogram [5] [10] Cache performance simulator [3] [8] [13] Hardware counters inside processor Proposed hardware cache monitoring architecture
Cache Monitoring Architecture • Cache-like Counter Organization • User-definable Granularity • Multilevel Monitoring • Adding Temporal Information • Integration Aspects
User-definable Granularity • Filter each event using an address range of interest • Mask all monitored events allowing to delete unimportant lower bits from any address • Aggregation of neighboring events during the monitoring process
Multilevel Monitoring • Used to compute both hit or miss rates for any cache with respect to given address region.
Adding Temporal Information • The main drawback of this approach does not provide any temporal information of memory access behavior, however, only contain the aggregated memory behavior of the whole monitoring period. • Monitor barriers which trigger a complete swap out of all partial results.
Integration Aspects • How to swap out counter value into the ring buffer without influencing the own measurements. • Masking these access through appropriate filter mechanisms or by establishing separate store pipelines. • The monitor will observe physical address, which need to be transferred into virtual address space for an evaluation at application by using correct page table.
Experimental Setup using SIMT • A general tool designed for evaluation of NUMA shared memory machines and includes simulation of complete memory hierarchy. • Non-Uniform Memory Access (NUMA) is a computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. • Single CPU with two-level memory hierarchy with 8 KB direct mapped L1 and 256 KB L2 two-way associative cache, each with 32 bytes cache lines. • LU decomposition, RADIX sort and WATER.
Histograms of RADIX • Show cache and memory access behaviors. • Show cache miss rates by multilevel memory hierarchy.
Histograms of various LU phases • Show partial memory access behaviors. • Phase 3, 5 show high locality on very few memory blocks. • Phase 4, 5 show large range of blocks access.
Hardware Complexity & Runtime Overhead • All histograms shown without any filtering and at highest possible accuracy by assuming cache line granularity. • Lead to runtime overhead due to the on-line delivery of memory access histograms. • The reduction in granularity and increased number of counters are two options in order to compensate overhead and keep minimum influence of monitor. • The reduction in granularity will lead to reduction in accuracy. • Increased number of counters will increase hardware complexity.
Results • Reduction high runtime overhead with coarsened granularity. • 32 or 64 counters are reasonable number and hence with reasonable hardware complexity. Number of Counters Granularity Swap out events
Conclusions • Novel architecture without major overhead of hardware complexity. • experimental shown that the precision of these access histograms is close to a full simulation. • This monitoring system is capable of observing any kind of event that can be related to an address.
Comments • Multilevel memory monitoring architecture. • partial and temporal cache behavior aggregation technique. • Counter array and ring buffer technique.