220 likes | 392 Views
Memory Subsystem Performance of Programs using Coping Garbage Collection. Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo. Introduction. Heap allocation with coping garbage collection is believed to have poor memory subsystem performance.
E N D
Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo
Introduction • Heap allocation with coping garbage collection is believed to have poor memory subsystem performance. • However, with the appropriate memory subsystem organization, heap allocation can have good memory subsystem performance.
Agenda • Background. Memory subsystem Cache Write buffer Page mode CPI Copying garbage collection SML • Related work • Methodology • Result and Analysis • Conclusions
Cache • It is known that CPUs get faster relative to DRAM memory chips. • A solution to this problem is to add a small fast memory call cache. Cache work by reducing the average memory access time. It is possible since memory access has temporal and spatial locality.
associativity Valid tag Block subblock subblock subblock subblock Valid tag Block V V V V SET Valid tag Block Cache
Cache Hit Policies • On read hit Read the word from cache. • Write through: Write the word to cache and memory. • Write back: Write the word to cache. Mark the block as dirty. When evicted block from cache, if dirty write it to memory.
Cache miss policies • On read miss the block is copying from main memory. • Write no allocate: Do not allocate block in the cache. Send the write to main memory, without putting the write in the cache. • Write allocate, no subblock placement: Allocate a block in the cache. Fetch the corresponding memory block from main memory. Write the word to cache and to memory. • Write allocate,subblock placement : Allocate block in the cache. Write the word to the cache and to memory. Invalidate the remaining words in the cache.
Memory Subsystem • Write buffer : Is a queue containing writes that are to be sent to main memory. • Page-mode : Main memory is divided into DRAM pages. Page-mode writes reduce the latency of write to the same DRAM page. • CPI - Cycles Per useful Instruction : number of CPU cycles to complete a program divided by the total number of useful instruction.
Coping Garbage Collection • Two memory areas • Memory allocation is done from FROMSPACE. • When FROMSPACE is full, moves all the live objects from FROMSPACE to TOSPACE. • Exchange names.
Generational Coping GC • Split objects into multiple areas by age. • Scan older objects area less frequently. • Copy long surviving objects to older generations area.
SML Standard ML • Call by value • Safe • Polymorphic • Functional • Garbage collection SML/NJ compiler • Making allocation cheap and function call fast. Allocation done in-line. Aggressive -reduction (in-line) function call is used. Extensive use of registers. • Allocate procedure activation record on the heap instead of the stack.
Related work This Work Advantage • This work made a different between read and write miss and there penalty. Previous work use overall miss ratios . • This work module the entire memory subsystem including the write buffer and DARM page-mode. Previous work did not module the entire memory subsystem. The conclusions of a work that study the cache write policies on the performance of C and Fortran programs support ours that write allocate with subblock is the preferred architecture.
Methodology • Tools : QPT - Used to produce memory traces for SML/NJ programs. Tycho - Used for the memory subsystem simulation. • Performance: Performance numbers are in CPI. All instruction besides nops are considered useful. • Benchmarks : The benchmark run on eight programs listed on the next table:
Memory Subsystem Simulation The memory features and penalty used in this study restrict to currently popular RISC workstation. All simulation use: • Write buffer (depth 6) • Page mode • Separated Data and Instruction caches • Write-through policy The simulations take over: • Cache size 8K-128K • Direct map and two-way set associative caches ( with LRU replacement). • Block size of 16 and 32 bytes • Write allocate versus write no allocate • Subblock placement versus no subblock placement.
Results and Analysis Analysis SML/NJ programs: • Programs do heap allocation at a rate of 0.2-0.4 words per instruction. • The majority of writes are initialization writes. • Writes come in bunches, they initialize newly allocated area. An aggressive write policy is necessary. • Avoid waiting for writes to memory write buffer & fast page mode. • On write miss avoid reading cache block write allocate with subblock placement cache policy is needed.
Result CW summary graphs
Result CW write alloc subblock block size 16
Conclusions Write miss policy and subblock placement: • It is clear from this study that the best cache organization is write-allocate / subblock placement. (Surprisingly for caches larger then 64k direct map cache the memory subsystem overhead of SML/NJ programs is acceptable less then 16%) • Performance of write allocate /no subblock is almost identical to write no allocate /no subblock. (Address is being read soon after being write,even for 8K cache. Since our program allocate 0.4-0.9 bytes per instruction , a read block occurs within 9K-20K). Associativity: • Increasing the associativity improve the CPI. (This improvement is less then the one obtained from subblock placement). • Higher associativity improves the instruction cache performance but has little impact on data cache. (A lot of the penalty from the instruction cache is due to conflict miss and that from data cache is due to capacity miss).
Conclusions Block size: • Increasing block size from 16 to 32 bytes improve the performance. Cache Size: • Increasing cache size improve the performance. • Most of the CPI improvement come from the instruction cache. (From related work we expect to see sharp improvement once it can feet the allocation area 512K is large enough to hold most benchmark) Write Buffer: • A six deep write buffer with page mode is sufficient to absorb the bursty writes. (Since there contribution to CPI is negligible)
Summary A depth study of the memory subsystem was made and the results show that: • Programs with intensive heap allocation performed poorly on most memory subsystem. However on some machine (DECstation 5000/200) the performance was good. • The most crucial parameter for good performance was subblock placement, in this case the overhead was under 16% for caches bigger then 64K. • Associativity and cache size (up to 128k) were more important for the instruction cache. Higher associativity and larger block size had small contribution.