120 likes | 321 Views
Cache coherence for CMPs. Miodrag Bolic. Private cache. Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]. Private cache. Advantages . Disadvantages. Data blocks can get duplicated
E N D
Cache coherence for CMPs Miodrag Bolic
Private cache • Each cache bank is private to a particular core • Cache coherence is maintained at the L2 cache level • Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]
Private cache Advantages Disadvantages Data blocks can get duplicated if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized • Short L2 cache access latency • Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited.
Shared cache • Cache coherence is maintained at the L1 level • Bits usually chosen for the mapping to a particular bank are the less significant ones • Piranha [16], Hydra [47], Sun UltraSPARC T2 [105] and Intel Merom [104]
Shared caches Advantage Disadvantages Many requests will be will be serviced by remote banks (L2 NUCA architecture) • Single copy of blocks • Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented.
Hammer protocol • AMD - Opteron systems • It relies on broadcasting requests to all tiles to solve cache misses • It targets systems that use unordered point-to-point interconnection networks • On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block • All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core. • The requesting core needs • to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.
Hammer protocol Disadvantages • Requires three hops in the critical path before the requested data block is obtained. • Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.
Directory protocol • In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block. • In tiled CMPs, the directory structure is split into banks which are distributed across the tiles. • Each directory bank tracks a particular range of memory blocks.
Directory protocol • The indirection problem • every cache miss must reach the home tile before any coherence action can be performed. • adds unnecessary hops into the critical path of the cache misses • The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations. • Example: block size 16 bytes, 64 tiles
Mapping between cache entries and directory entries • One way to keep constant the size of the directory entries is storing duplicate tags.