290 likes | 467 Views
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking. Jason F. Cantin , Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7 th , 2005. Overview of Idea. Coarse-Grain Coherence Tracking:
E N D
Improving Multiprocessor Performance with Coarse-Grain Coherence Tracking Jason F. Cantin, Mikko H. Lipasti, and James E. Smith International Symposium on Computer Architecture June 7th, 2005
Overview of Idea Coarse-Grain Coherence Tracking: • Monitors coherence status of memory at a multi-line granularity • Uses the coarse-grain information to identify requests that don’t need a coherence broadcast • Sends these requests directly to memory ISCA 2005
Broadcast Network Data Network P P P NC $ MC P DRAM DRAM DRAM DRAM Problem Snoop-based systems support a limited number of processors • Limited broadcast bandwidth • Increasing memory latency ISCA 2005
Opportunity • Some data requests don’t need a broadcast • Requests for non-shared data • Fetches of unmodified instructions • Write-backs • Some non-data requests don’t need to leave the processor • Requests to upgrade copy, but not shared • Requests to flush copies, but not cached elsewhere ISCA 2005
Unnecessary Broadcasts ISCA 2005
Our Approach • Identify requests that don’t need a broadcast • Send data requests directly to memory • Reduce broadcast traffic • Reduce latency in some systems • Avoid sending non-data requests externally • Further reduce broadcast traffic • Reduce latency ISCA 2005
Coarse-Grain Coherence Tracking • Memory is divided into coarse-grain regions • Aligned, power-of-two multiple of cache line size • Can range from two lines to a physical page • A cache-like structure is added to each processor for monitoring coherence at the granularity of regions • Region Coherence Array(RCA) ISCA 2005
Coarse-Grain Coherence Tracking • Each entry has an address tag, state, and count of lines cached by the processor • The state indicates if the processor and / or other processors are sharing / modifying lines in the region • On cache misses, the region state is read to determine if a broadcast is necessary ISCA 2005
Coarse-Grain Coherence Tracking • On snoops, the region state provides a response for the region • Piggy-backed onto the conventional response • Used to update other processors’ region state • RCA maintains inclusion over caches • When regions are evicted, their lines are evicted • RCA must respond correctly if region’s lines cached • Replacement algorithm uses line count ISCA 2005
Example: Conventional Snooping Network Read: P0, 100002 Read: P0, 100002 Invalid Invalid Tag State • P0 loads 100002 $0 0000 0010 Exclusive Pending Invalid 0000 $1 Invalid • MISS 0000 Invalid 0000 Invalid Data • Snoop performed Load: 100002 Data P0 P1 • Response sent • Data transfer M0 M1 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network P0 has exclusive access to region Read: P0, 100002 Invalid, Region Not Shared Read: P0, 100002 Invalid, Region Not Shared Tag State • P0 loads 100002 0000 0010 $0 Exclusive Pending Invalid 000 001 RCA Invalid Pending DI $1 0000 Invalid 000 RCA Invalid • MISS 0000 Invalid 000 Invalid 0000 Invalid 000 Invalid Data • Snoop performed Load: 100002 P0 P1 • Response sent Data • Data transfer M0 M1 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Exclusive region state, broadcast unnecessary Tag State • P0 loads 110002 $0 0010 Exclusive RCA 001 DI 0000 $1 Invalid 000 RCA Invalid • MISS, Region Hit 0011 0000 Exclusive Pending Invalid 000 Invalid 0000 Invalid 000 Invalid Data • Direct request sent Load: 110002 P0 P1 • Data transfer Read: P0, 110002 Data M0 M1 ISCA 2005
Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Network Region not exclusive anymore Owned, Region Owned RFO: P1, 100002 Owned, Region Owned RFO: P1, 100002 • P1 stores 100002 0010 $0 Pending Invalid Exclusive RCA 001 DI DD $1 0000 0010 Pending Invalid Modified 001 RCA 000 Pending Invalid DD • MISS 0011 Exclusive 000 Invalid 0000 Invalid 000 Invalid Data • Snoop performed Store: 100002 Data • Hits in P0 cache P0 P1 • Response sent • Data transfer M0 M1 ISCA 2005
Overhead • Storage space needed for RCA • 3-6% storage overhead for cache • Two bits needed in snoop response for region response • Path to memory needed to avoid broadcasts • Simple with on-chip memory controllers • May leverage data network ISCA 2005
Simulator PHARMsim: • Execution-driven simulator built on top of SimOS-PPC • Four 4-way superscalar out-of-order processors • Two-level hierarchy with split L1, unified L2 caches • Separate address / data networks –similar to Fireplane • Region Coherence Array with same sets/assoc. as L2 ISCA 2005
Workloads • Scientific • Ocean, Raytrace, Barnes • Multiprogrammed • SPECint2000_rate • Commercial • TPC-W, TPC-B, TPC-H, SPECweb99, SPECjbb2000 ISCA 2005
Broadcasts Avoided ISCA 2005
Snoop Traffic Reduction – Peak 64% 51% 38% ISCA 2005
Snoop Traffic Reduction – Average 47% 74% 86% ISCA 2005
Execution Time 91.2% ISCA 2005
Remaining Opportunity • With 512B regions, ~10% of requests are broadcast unnecessarily • A third of the 10% are region false sharing • Half of the 10% miss in RCA • Potential for prefetching ISCA 2005
Inclusion Overhead --Regions with no lines cached replaced first ISCA 2005
Conclusion Coarse-Grain Coherence Tracking: • Reduces broadcast traffic • Most data requests sent directly to memory • Reduces latency • Many requests not sent to central arbitration point • Many non-data requests not sent externally • Improves scalability and performance ISCA 2005
The End ISCA 2005
Inclusion Evictions ISCA 2005
Ordering • Ordering point is now the Region Coherence Array • A direct request is ordered once it accesses the RCA • Direct requests are serialized w.r.t. to snoop requests • A direct request occurs either before, or after a snoop • All must appear to access and update RCA atomically • No two processors can have exclusive access to a region at the same time (no races) ISCA 2005
Comparison to RegionScout ISCA 2005
Execution Time ISCA 2005