470 likes | 666 Views
Performance Analysis of HPC with Lmbench. Didem Unat Supervisor: Nahil Sobh July 22 nd 2005 netfiles.uiuc.edu/dunat2/www. Simple, portable benchmarks Compares different Unix systems performance Measures latency and bandwidth
E N D
Performance Analysis of HPC with Lmbench Didem Unat Supervisor: Nahil Sobh July 22nd 2005 netfiles.uiuc.edu/dunat2/www
Simple, portable benchmarks Compares different Unix systems performance Measures latency and bandwidth Only analyzes performance of processor, memory, network, file system and disk Free software Lmbench: Micro-Benchmark Suite
Compiler & optimization issues • The GNU C compiler is used for all the resources but copper • IBM xlc compiler was used on copper. • All of the benchmarks were compiled with optimization -O except the benchmarks that calculate clock speed and the context switch times
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Inter Process Communication Bandwidth MB/sec • Transfers 64 MB of data in 64 KB chunks through • Unix Pipe • Unix sockets • TCP/IP sockets
Inter Process Communication Bandwidth Co MB/sec • Transfers 64 MB of data in 64 KB chunks through • Unix Pipe • Unix sockets • TCP/IP sockets W
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • File and VM system • Inter process communication • Memory read latencies
A reread benchmark, intended to be used on a file that is in memory File reread : copies data from the kernel’s file system page into the processor’s buffer Mmap reread : maps the entire file (8 MB) into process’s address space Cached file read
Bandwidth Pipe/TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • File and VM system • Inter process communication • Memory read latencies
Measures how fast the system can bcopy data Bcopy copies n bytes from string source to string destination An 8 MB to 8 MB copy, does not fit in the cache Kernel bcopy and C library bcopy C library bcopy shown in the next slide Memory copy
Bandwidth Pipe/TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • File and VM system • Inter process communication • Memory read latencies
Read Measures the time to read data into the processor An unrolled loop that sums up a series of integers Write Measures the time to write data to memory An unrolled loop that stores a value into an integer Memory read/write
1 2 3
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Operating System Entry/ Signal Handling / Process Creation Costs • Process-related latencies • System Call null call, null I/O, stat, open/close • Signal Handling signal installation, signal handling • Process Creation fork + exit, fork + execve, fork + /bin/sh -c
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Context Switching • The time to save the state of one process and restore the state of another process • The processes are connected in a ring of Unix pipes • A token is passed from process to process • The process allocates an array and sums the array • Context-switch time doesn't include the overhead of doing the work. • Two parameters: number and size of processes
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Interprocess Communication Latencies • Passing a small message back and forth between two processes • The time reported is one round trip • Message size: a byte or a word • Metrics: Pipe, Unix Socket, UDP and TCP , RPC/UDP-TCP, TCP connection latency
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
File & VM System • File create/ delete creates a number of small files in the current working directory and then removes the files • Mmap latency : costs of mmapping and unmmapping varying file sizes • Prot fault : the time to catch a protection fault • Page fault : the cost of page faulting pages from a file • 100 fd selct : the time to do a select on n file descriptors
Bandwidth Pipe/ TCP Cached file read Memory copy Memory read/write Metrics in the Benchmark Latency • System call • Signal handling • Process creation • Basic CPU operations • Context switching • Inter process communication • File and VM system • Memory read latencies
Memory Latencies • Measures memory read latency for varying memory sizes and strides • The size of the array starts from 512 bytes • The stride varies from 16 to 1024 • Does not include the instruction execution time
the best has problems Conclusion
THANK YOU ! Have a nice weekend !
References • “Lmbench – Tools for Performance Analysis” http://www.bitmover.com/lmbench/ • Larry McVoy and Carl Staelin, “Lmbench: Portable tools for performance analysis” http://www.usenix.org/publications/library/proceedings/ sd96/full_papers/mcvoy.pdf • Carl Staelin, “Lmbench:an extensible micro-benchmark suite” http://www.hpl.hp.com/techreports/2004/HPL-2004-213.html