260 likes | 483 Views
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling. Licheng Chen , Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). ISPASS 2012 April 2, 2012.
E N D
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, YongbingHuang, and Guangming Tan Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS) ISPASS 2012 April 2, 2012
Background • Memory behavior is the key factor of the performance of a program. • Understanding memory behavior is significant for identifying the bottleneck of both architecture and application. • For example, • TLB is an essential component of memory system • Applications’ working set tends to be larger and lager, leading to serious TLB miss • Study 1: that TLB miss can degrade system performance by 5~14% [Bhargava’08] • Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08] Done by memory profiling
Memory Profiling • Memory profiling is to collect memory behavior information during the execution of programs. • Profiling can be performed for • different hardware components • at different software levels TLB/Cache/DRAM Function Objects (Array, List etc.) Application Whole System
Object Memory Profiling • Object refers to a group of data stored as a unit [Wu’04] • Distinguish regular patterns from mixed and irregular traces • Valuable for optimization • Memory trace compression • Data layout • Object-level prefetching • Cache partition [Soft-OLP, PACT 2009] Object Trace Application Traces Whole System Traces Irregular Regular
Current Profiling Approaches • Existing approaches • Compiler-driven: re-compile/re-link, source code • Instrumentation: heavy overhead • Simulation: accuracy problem, slow • Performance Counter: lack of detailed information • All cannot observe page table walks due to TLB Miss • We propose a hybrid hardware/software approach for object memory profiling • Accurate: real application & real system • Lightweight • Track page table walks at object-level
Outline • Background • Design and Implementation • Experimental Results • Conclusion
An Overview Virtual Address Trace Physical Address Trace Object Access Pattern 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Matrix (VA: 0x1f05000)
HMTT • Hybrid Memory Trace Toolkit • A DDR3 SDRAM compatible memory trace monitoring system • Adopts hardware snooping technology Memory Trace: <time_stamp, r/w, phy_addr> • Advantages: • Platform independent • Negligible overhead • Full-systemreal memory traces, including OS, page table walks PCIE Cable Connector DIMM plugged on the other side
Challenges (1) • How to translate physical address trace to virtual address trace of a specific process? • ModifyOS kernel to obtain page table • Lookup a phy_addrin the dumped page table • Generate virtual trace of each process
Challenge (2) • How to synchronize hardware and software when an page table update occurs in kernel? • Physical Page allocation/Free in kernel • Trigger annotations in OS VM module • Update dumped page table • Send a sync_tag to hardware
Challenge (3) • How to translate virtual address to objects without modifying source codes? Virtual Address Space • The role of malloc() is to map VA to object • Use dynamic library overwrite to replace malloc() Object: matrix matrix = mymalloc(0x1000) matrix = malloc(0x1000) Object-VA Mapping Table
Put them all together Virtual Address Trace Physical Address Trace Object Access Pattern 0x1f05000 0x1f06000 0x1f07000 …… 0x1f15000 0x1f16000 0x1f17000 …… 0x1f25000 0x1f26000 …… 0x398f24a 0x398f24b 0x398f24c …… 0x1af4aa 0x1af4a6 0x1af4a8 …… 0x38d2cfc 0x38d2cfd …… Matrix (VA: 0x1f05000) Dumped Page Table sync_tag page walk sync_tag Object-VA Mapping Table page walk • Use page table to distinguish three types of memory access • Sync_tag update page table • Access page table itself page table walk due to TLB miss • Other memory access virtual address
Validation • For SpMV benchmark (CSR) : y = ax * xhost • Micro-benchmark: • The error is less than 2% • Our system is able to distinguish regular access pattern from irregular pattern
Overhead • Two main overhead: • Dumping page table traces: + dump_pt • Dumping object-VA mapping: + dump_obj • Monitoring objects >= 4KB: result in most memory references <2% <1%
Case Study 1: BFS (Breadth-First Search) • columnobject got about 71% of page walks key object • Optimization: use huge page for column object • Speedup: about 12% for 8-thread, 8% for 128-thread 8.18%
Case Study 2: Canneal (PARSEC) • Cache-aware simulated annealing (SA) to minimize the routing cost of a chip design • Two objects contribute most of the memory accesses: _elementsand _location The memory accessalmost do not change while increasing thread number.
Case Study 2: Canneal • _elements object contributes the most of the increased page walks • Put the _elements object into huge page to reduce TLB miss Speedup: about 5% for 8-thread
Conclusion • We have designed and implemented a hybrid hardware/software approach to conduct object-relative memory profiling. • Accurate: real application & real system • Lightweight • Track page table walks at object-level • We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance. • We intend to use this approach to analyze virtual machine on real machines.
Thanks! &Questions?
Memory Profiling Approaches Note: √-Yes, ×-No, *-Maybe
Reverse Page Table • Physical address pid, virtual address
Validation • Access objects with different pattern: • a0: all read accesses, forward • a1: 3/4 read and 1/4 write accesses, forward • a2: 2/4 read and 2/4 write accesses, forward • a3: 1/4 read and 3/4 write accesses, backward • a4: all write accesses, backward Size 256MB, access step 64B, requests: 4M a0 a4
HMTT Configuration Space • A reserved physical memory region • Can be accessed by source codes and binary codes