730 likes | 969 Views
Tornado. Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa , Orran Krieger, Jonathan Appavoo , Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: Anusha Muthiah , 4 th Dec 2013.
E N D
Tornado Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm (Department of Electrical and Computer Engineering, University of Toronto, 1999) Presented by: AnushaMuthiah, 4th Dec 2013
Why Tornado? Previous shared memory multiprocessors evolved from designs for architecture that existed back then
Locality – Traditional System • Memory latency was low: fetch – decode – execute without stalling • Memory faster than CPU = okay to leave data in memory CPU CPU CPU Loads & Stores Shared Bus Shared memory
Locality – Traditional System • Memory got faster • CPU got faster than memory • Memory latency was high. Fast CPU = useless! CPU CPU CPU Shared Bus Shared memory
Locality – Traditional System • Reading memory was very important • Fetch – decode – execute cycle CPU CPU CPU Shared Bus Shared memory
Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Extremely close fast memory • Hitting a lot of addresses within a small region • E.g. Program – seq of instructions • Adjacent words on same page of memory • More if in a loop
Cache CPU CPU CPU cache cache cache Shared Bus Shared memory Repeated acess to the same page of memory was good!
Memory latency • CPU executes at speed of cache • Making up for memory latency CPU Really small How can it help us more than 2% of the time? cache Bus Memory
Locality • Locality: same value or storage location being frequently accessed • Temporal Locality: reusing some data/resource within a small duration • Spatial Locality: use of data elements within relatively close storage locations
Cache • All CPUs accessing same page in memory CPU CPU CPU cache cache cache Shared Bus Shared memory
Example: Shared Counter Copy of counter exists in both processor’s cache CPU-1 CPU-2 Cache Cache Shared by both CPU-1 & CPU-2 Memory Counter
Example: Shared Counter CPU-1 CPU-2 Memory 0
Example: Shared Counter CPU-1 CPU-2 0 Memory 0
Example: Shared Counter Write in exclusive mode CPU-1 CPU-2 1 Memory 1
Example: Shared Counter Shared mode Read : OK CPU-1 CPU-2 1 1 Memory 1
Example: Shared Counter 1 2 Invalidate From shared to exclusive CPU-1 CPU-2 2 Memory 2
Example: Shared Counter Terrible Performance!
Problem: Shared Counter • Counter bounces between CPU caches leading to high cache miss rate • Try an alternate approach • Counter converted to an array • Each CPU has its own counter • Updates can be local • Number of increments = commutativity of addition • To read, you need all counters (add up)
Example: Array-based Counter CPU-1 CPU-2 Array of counters, one for each CPU CPU-1 Memory 0 0 CPU-2
Example: Array-based Counter CPU-1 CPU-2 1 Memory 1 0
Example: Array-based Counter CPU-1 CPU-2 1 1 Memory 1 1
Example: Array-based Counter Read Counter CPU 2 CPU-1 CPU-2 1 1 Add All Counters (1 + 1) Memory 1 1
Performance: Array-based Counter Performs no better than ‘shared counter’!
Why Array-based counter doesn’t work? • Data is tranferred between main memory and caches through fixed sized blocks of data called cache lines • If two counters are in the same cache line, they can only be used by one processor at a time for writing • Ultimately, still has to bounce between CPU caches Memory 1 1 Share cache line
False sharing CPU-1 CPU-2 Memory 0,0
False sharing CPU-1 CPU-2 0,0 Memory 0,0
False sharing Sharing CPU-1 CPU-2 0,0 0,0 Memory 0,0
False sharing Invalidate CPU-1 CPU-2 1,0 Memory 1,0
False sharing Sharing CPU-1 CPU-2 1,0 1,0 Memory 1,0
False sharing Invalidate CPU-1 CPU-2 1,1 Memory 1,1
Ultimate Solution • Pad the array (independent cache line) • Spread counter components out in memory
Example: Padded Array CPU-1 CPU-2 Individual cache lines for each counter Memory 0 0
Example: Padded Array Updates independent of each other CPU-1 CPU-2 1 1 Memory 1 1
Performance: Padded Array Works better
Cache • All CPUs accessing same page in memory • Write sharing is very destructive CPU CPU CPU cache cache cache Shared Bus Shared memory
Shared Counter Example • Minimize read/write & write sharing = minimize cache coherence • Minimize false sharing
Now and then • Then (Uni): Locality is good • Now (Multi): Locality is bad. Don’t share! • Traditional OS: implemented to have good locality • Running same code on new architecture = cache interference • More CPU is detrimental • CPU wants to run from its own cache without interference from others
Modularity • Minimize write sharing & false sharing • How do you structure a system so that CPUs don’t share the same memory? • Then: No objects • Now: Split everything and keep it modular Paper: • Object Oriented Approach • Clustered Objects • Existense Guarantee
Goal • Ideally: • CPU uses data from its own cache • Cache hit rate good • Cache contents stay in cache and not invalidated by other CPUs • CPU has good locality of reference (it frequently accesses its cache data) • Sharing is minimized
Object Oriented Approach • Goal: • Minimize access of shared data structures • Minimize using shared locks • Operating Systems are driven by requests of applications on virtual resources • Good performance – requests to different virtual resources must be handled independently
How do we handle requests independently? • Represent each resource by a different object • Try to reduce sharing • If an entire resource is locked, requests get queued up • Avoid making resource a source of contention • Use fine-grain locking instead
Coarse/Fine grain locking example • Process object: maintains the list of mapped memory region in the process’s address space • On a page fault, it searches process table to find the responsible region to forward page fault to Thread 1 Region 1 Region 2 Region 3 Whole table gets locked even though it only needs one region. Other threads get queued up … Region n Process Object
Coarse/Fine grain locking example Thread 1 Region 1 Region 2 Use individual locks Region 3 … Thread 2 Region n
Key memory management object relationships in Tornado If file data not cached in memory, request new phy page frame Forwards request to responsible region Page fault delivered Translates fault addr into a file offset. Forwards req to FCM And asks Cached Object Rep to fill page from file If file data is cached in memory, address of corresponding physical page frame is returned to Region which makes a call to Hardware Address Translation object
Advantage of using Object Oriented Approach • Performance critical case of in-core page fault (TLB miss fault for a page table resident in memory) • Objects invoked specific to faulting object • Locks acquired & data structures accessed are internal to the object • In contrast, many OSes use global page cache or single HAT layer = source of contention
Clustered Object Approach • Obj Oriented is good, but some resources are just too widely shared • Enter Clustering • Eg: Thread dispatch queue • If single list used : high contention (bottleneck + more cache coherency) • Solution: Partition queue and give each processor a private list (no contention!)
Clustered Object Approach All clients access a clustered object using a common clustered obj reference Each call to a Clustered Obj automatically directed to appropriate local rep Actually made up of several component objects. Each rep handles calls from a subset of processors Looks like a single object
Clustered object • Shared counter Example But actually made up of representative counters that each CPU can access independently It looks like a shared counter
Degree of Clustering One rep for the entire system Cached Obj Rep – Read mostly. All processors share a single rep. One rep for a cluster of Neighbouring processors Region – Read mostly. On critical path for all page faults. One rep per processor FCM – maintains state of pages of a file cached in memory. Hash table for cache split across many reps