220 likes | 393 Views
EE Department Technion, Haifa, Israel. The Power of Priority : NoC based Distributed Cache Coherency. Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny. QNoC Research Group Technion. Chip Multi-Processor (CMP). Multi-Core Large cache Shared cache Distributed cache
E N D
EE Department Technion, Haifa, Israel The Power of Priority:NoC based Distributed Cache Coherency Evgeny Bolotin, Zvika Guz, Israel Cidon, Ran Ginosar, Avinoam Kolodny QNoC Research GroupTechnion
Chip Multi-Processor (CMP) Multi-Core Large cache Shared cache Distributed cache NoC-based: How? Dual-Core Monolithic shared cache
100 Global Wires Delay Global wire delay • Global wires delay 10 • Distance reached in single cycle • Today: ~25% of chip • In 10 years: ~1% of chip 1 Gate delay 0.1 250 250 180 130 90 65 45 32 250 Source: ITRS 2003 Fraction of chip reachable in 1 clock cycle Source: Keckler et al. ISSCC 2003 Future Cache - Physics Perspective • Large cache Large access time Large monolithic cache is not scalable
NUCA - Non Uniform Cache Architecture Banked cache over NoC • Smaller bank Smaller Access Time • Multiple banks Multiple Ports • Closer bank Smaller Access Time NUCA= Non uniform access times Cache-line placement policy • Static NUCA (SNUCA) • Dynamic NUCA (DNUCA) Sources: Kim et al. ASPLOS 2002 Beckmann et al. MICRO 2004
Issues in NUCA-based CMP • NoC performance CMP performance • Cache coherency and transaction order (correctness) • Search (in DNUCA) • Different traffic types (e.g. fetch vs. prefetch) • Synchronization (locks) NoC Services for CMP?
Cache bank with distributed directory Cache Coherency over NoC How do we maintain coherency over NoC? • Distributed directory • Snooping • Central directory
Ctrl. packet Data packet Distributed Cache Coherency Cache access Multiple NoC transactions Example: Simple read transaction
Ctrl. packet Data packet Read Transaction of Modified Block
Ctrl. packet Data packet Read Exclusive of Shared Block
Vanilla NoC Basic NoC to Support CMP Off-the-shelf (Vanilla) NoC: • Grid of wormhole routers • Unicast only • Ordering in network • Static routing • No virtual channels • Smart interfaces Can We Do Better?
Observations: L2 Access A) Delay = Queueing + NoC transactions B) All NoC transactions are equally important • C) NoC transactions consist of: • Short ctrl. packets • Long data packets Idea: Differentiate between Ctrl. and Data • Solution: Preemptive Priority NoC • Give priority to short ctrl. packets
Preemptive Priority NoC: QNoC QNoC Multiple SL Router Service Levels: • Dedicated wormhole buffer • Preemptive priority scheduling Multiple SL link
Transaction 1 Long Data Transaction 2 Short Req. Long Resp. Example: Vanilla NoC Without contention: X:Delay of long packet δ:Delay of short packet Vanilla NoC example Blue delay ~X Red delay ~ 2X+δ Average delay ~ 1.5X A B
Transaction 1 Long Data Transaction 2 Short Req. Long Resp. Example: Priority NoC Without contention: X:Delay of long packet δ:Delay of short packet Vanilla NoC example Blue delay=X Red delay = 2X+δ Average delay ~ 1.5X A B Priority NoC example Blue delay= X+δ Red delay = X+δ Average delay ~ X Potential delay reduction ~ 0.5X
Priority NoC: Different Destinations • Very important in wormhole • When ctrl. packet is blocked by other worms Long Data Short Req.
Protocol Correctness Need state-preserving serialization of transactions in the processor interface
Numerical Evaluation • CMP simulator (SIMICS) • Simulate parallel benchmarks • Obtain L2-cache access traces • QNoC simulator (OPNET) • Simulate distributed coherence protocol over NoC • Measure total RD/RX L2-access delay • Measure total program throughput
Priority NoC: Results Delay Reduction vs. Network Load RD Delay - Apache RD/RX Delay Reduction - Apache • Short ctrl. packet gets high priority • Long data packet gets low priority
Priority NoC: Several Benchmarks Delay Reduction Program Speedup
So Far: The Power of Priority • Simplicity - Almost for Free • Significant CMP Speed-up • Good For: • Coherency • Traffic differentiation (e.g. Fetch vs. Pre-Fetch) • Search in DNUCA • Synchronization (Locks)
Advanced Support Functions • Special Broadcast for Short Messages • Broadcast service (e.g. search in DNUCA) • Wormhole broadcast slow and expensive S&F broadcast embedded in wormhole • Virtual Ring • No Additional Cost • For Invalidation Multicast • Snooping or synchronization
Summary • NoC at CMP Service! • Shared cache over NoC • Priority is powerful • Built-in support functions