1 / 24

ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems

ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems. Memory Characteristics. Caching performance important for system performance Caching tightly integrated with networking Physcial properties Consider topology and distribution of memory Develop an effective coherency strategy

reeves
Download Presentation

ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 669Parallel Computer ArchitectureLecture 17Memory Systems

  2. Memory Characteristics • Caching performance important for system performance • Caching tightly integrated with networking • Physcial properties • Consider topology and distribution of memory • Develop an effective coherency strategy • Limitless approach to caching • Allow scalable caching

  3. Memory w1 r1 r1 w2 w3 r2 r3 r3 w3 w3 Perspectives • Programming model and caching. or: the meaning of shared memory Sequential consistency: Final state (of memory) is as if all RDs and WRTs were executed in some given serial order (per processor order maintained) [This notion borrows from similar notions of sequential consistency in transaction processing systems.] -Lamport r1 r2 r1 w2 w2 w3 ....

  4. Coherent Cache Implementation • Twist: • On write to shared location • Invalidation sent in background • Processor proceeds M A = O A = 1 C A = 0 1 A = 1 Proceed P

  5. M M A=0 x=0 A=0 x=0 A=0 x=0 C C P2 P1 Does caching violate this model?

  6. M M A=0 x=0 A=0 x=0 A=0 x=0 C C P2 P1 Does caching violate this model? A=1 x=1 LOOP: If (x= = 0) GOTO LOOP; b=A If b = = 0 at the end, sequential consistency is violated

  7. M M A=0 x=0 A=0 x=0 A=0 x=0 C C P2 P1 Does caching violate this model? x=1 1 A=1 1 1 1 LOOP: If (x= = 0) GOTO LOOP; b=A If b = = 0 at the end, sequential consistency is violated

  8. M M A=0 x=0 2 2 A=0 x=0 A=0 x=0 C C 1 3 4 1 P2 P1 5 Does caching violate this model? x=1 x=1 inv x x delay A=1 1 1 x=1 A=1 x=1 LOOP: If (x= = 0) GOTO LOOP; b=A b=0! VIOLATION! If b = = 0 at the end, sequential consistency is violated

  9. Does caching violate this model? M M A=0 x=0 A=1 x=1 delay... 3 A=1 inv x 7 ACK C C A=0 x=0 A=0 x=0 4 A=1 x=1 A=1 3 1 6 5 2 P2 P1 fence A=1 x=1 0 7 LOOP: If (x= = 0) GOTO LOOP; b=A b=1 ! o.k.

  10. Does caching violate this model? • Not if we are careful. Ensure that at time instant t, no two processors see different values of a given variable. On a write: • Lock datum • Invalidate all copies of datum • Update central copy of datum • Release lock on datum Do not proceed till write completes (ack got) How do we implement an update protocol? Hard! • Lock central copy of datum • Mark all copies as unreadable • Update all copies --- release read lock on each copy after each update • Unlock central copy

  11. Writes are looooong -- latency ops. • Solutions - 1. Build latency tolerant processors - Alewife 2. Change shared-memory semantics [solve a different problem!] 3. Notion of weaker memory semantics Basic idea - Guarantee completion of write only on “fence” operations Typical fence is synchronization point (or programmer puts fences in) Use: • Modify shared data only within critical sections • Propagate changes at end of critical section, before releasing lock Higher level locking protocols must guarantee that others do not try to read/write an object that has been modified and read by someone else. For most parallel programs -- no problem see later

  12. P P P P Memory Systems • Memory storage • Communication • Processing • Programmer’s view • Physically, Memory wrt read . . . Monolithic Distributed Distributed - local Memory M M M M Network Network Network P P P . . . . . . P . . . P M M P P P M P P P

  13. Addressing • I. Like uniprocessors Could include a translation phase for virtual memory systems • II. Object-oriented models Address Offset Node ID M M M . . . Address Object-ID, Address Table ID Loc

  14. Issues in virtual memory (also naming) • Goals: • Illusion of a lot more memory than physically exists. • Protection - allows multiprogramming • Mobility of data: indirection allows ease of migration • Premise: • Want a large, virtualized, single address space • But, physically distributed, local

  15. Memory Performance Parameters • Size (per node) • Bandwidth (accesses per second) • Latency (access time) • Size: • Issue of cost. • Uniprocessors 1 MByte per MIPS • Multiprossors? Raging debate Eg. Alewife 1/8 MByte memory per MIPS Firefly 2 MByte per MIPS What affects memory size decision? Key issues Communication bandwidth memory size tradeoffs Balanced design --- All components roughly equally utilized

  16. No VM VA = PA Address: Relatively small address space Processor # Offset . . . PM P P P . . .

  17. Virtual Memory At source translation • Large address space • Straightforward extension from uniprocessors • Xlate in software, in cache, or TLBs PA . . . PM PA xlate VA P P P . . . VA . . .

  18. VM – At Destination Translation • On page fault at destination • Fetch page/obj from a local disk • Send msg to appropriate disk node memory address node # xlate . . . PM PA (or miss) VA P P P . . .

  19. Next, bandwidth and latency • In the interests of keeping the memory system as simple as possible, and because distributed memory provides high peak bandwidth, we will not consider interleaved memories as in vector processors • Instead, look at • Reducing bandwidth demand of processors • Reducing latency of memory Exploit locality Property of reuse • Caches

  20. M M M C C C P P P Caching Techniques for multiprocessors • How are caches different from local memory? • Fine-grain relocation of blocks • HW support for management, esp. for coherence • Smaller, faster, integrable • Otherwise have similiar properties as local memory Network

  21. M M M C C C P P P Caching Techniques for multiprocessors • How are caches different from local memory? • Fine-grain relocation of blocks • HW support for managment, esp. for coherence • Smaller, faster, integrable • Otherwise have similiar properties as local memory Network Say, no caching rd rd

  22. M M M C C C P P P Caching Techniques for multiprocessors • How are caches different from local memory? • Fine-grain relocation of blocks • HW support for managment, esp. for coherence • Smaller, faster, integrable • Otherwise have similiar properties as local memory Network with caches

  23. M M M C C C P P P Caching Techniques for multiprocessors • How are caches different from local memory? • Fine-grain relocation of blocks • HW support for managment, esp. for coherence • Smaller, faster, integrable • Otherwise have similiar properties as local memory Network Network req on: - wrt to clean - read of remote dirty Coherence problem ? wrt

  24. Summary • Understand how delay affects cache performance • Maintain sequential consistency • Physcial properties • Consider topology and distribution of memory • Develop an effective coherency strategy • Simplicity and software maintenance are keys

More Related