170 likes | 192 Views
Operating System Support for improving data locality on CC-NUMA machines. CSE597A Presentation By V.N.Murali. WHY CC-NUMA?. Scalable with increase in number of nodes Attractive properties.Transparent access to local and remote memory at the cost of increased access latency to remote memory.
E N D
Operating System Support for improving data locality on CC-NUMA machines CSE597A Presentation By V.N.Murali
WHY CC-NUMA? • Scalable with increase in number of nodes • Attractive properties.Transparent access to local and remote memory at the cost of increased access latency to remote memory. • 2 variations,CC-NUMA-(Stanford DASH,MIT Alewife,Sequent),CC-NOW(SUN s3.mp).
OS support • Most important issue :Data locality, • Performance enhancement provided by OS supported page migration and replication by as much as 30%
Issues in Migration/Replication • When should pages be migrated? • When should pages be replicated? • Both are needed to boost performance. • When not to migrate/replicate is also important. • Which system parameter can be used to decide? Ideas?
Differences with S/W shared memory • M & R in S/W DSM is needed for correctness.On CC-NUMA M&R is purely an optimization. • M & R in S/W DSM is triggered by page faults.On CC-NUMA M&R is triggered by cache misses.
If workload exhibits good cache locality,less benefits from M&R.Hence selective criteria for moving pages. • Study based on SimOS environment.
Solution • How do we improve data locality? • 3 access patterns a)primarily accessed by a single process b)mostly read access by many processes c)both read and write access by many processes • Which method has to be applied for a),b),c)?
Costs to be considered • 1)Cost of determining candidate pages for M&R. (Cost of cache misses/TLB misses) • 2)Overhead of M&R.(new mappings,allocating a page,flushing TLB) • 3)Actual data transfer • 4)Memory pressure!
miss rate to page HIGH LOW nothing sharing? HIGH LOW write freq. and mem.pressure migration rate HIGH HIGH LOW LOW nothing replicate nothing migrate
Summary of the algorithm • “Hot page”:page whose counter for a processor reaches the trigger threshold • If the miss counter for this page (on any other processor) reaches the sharing threshold then it is considered for replication else it is considered for migration. • Replicated only if write counter has not exceeded write threshold.Migrated only if the migrate counter has not exceeded migrate threshold
Implementation details • Directory controller maintains the miss counters and generates a low-priority interrupt. • Bunches a couple of pages before raising interrupt. • Writes to replicated pages are collapsed to a single page
IRIX changes • Replication support • Finer grain locking • Page table back mappings
Workloads • Engineering workload:large sequential + memory intensive,used Verilog simulator,Flashlite. • Parallel application : Raytrace which is a parallel graphics algorithm • Scientific workload : Splash • Decision support database • Multiprogrammed software: Pmake
Performance analysis • 3 factors a)user stall time ,b)fraction of misses satisfied in local memory,c)kernel overhead. • Engineering:large user stall time=>best performance gain.M&R were used successfully • Raytrace: read only accesses mostly.Mainly benefits from replication.
Splash:3 parallel applications,Raytrace,Ocean,Volume rendering.For ocean migration is helpful.Raytrace and Volume can benefit from replication • Database:mostly read access and hence replication
Alternative policies • Static policies,dynamic policies. • Static:Round robin,First touch,Post facto(similar to optimal page replacement algorithm) • Dynamic:Migration only,replication only,Migration-Replication.