680 likes | 796 Views
SMP, 64bit Unix and Kernel Compilation. Guntis Barzdins Girts Folkmanis. Kā palaist MPI. [guntisb@zars mpi]$ ls -l total 392 -rw-rw-r-- 1 guntisb guntisb 122 Apr 28 07:08 Makefile -rw-rw-r-- 1 guntisb guntisb 13 May 17 14:33 mfile
E N D
SMP, 64bit Unix and Kernel Compilation Guntis Barzdins Girts Folkmanis
Kā palaist MPI [guntisb@zars mpi]$ ls -l total 392 -rw-rw-r-- 1 guntisb guntisb 122 Apr 28 07:08 Makefile -rw-rw-r-- 1 guntisb guntisb 13 May 17 14:33 mfile -rw-rw-r-- 1 guntisb guntisb 344 May 12 09:28 mpi.jdl -rw-rw-r-- 1 guntisb guntisb 2508 Apr 28 07:08 mpi.sh -rwxrwxr-x 1 guntisb guntisb 331899 May 17 14:48 passtonext -rw-rw-r-- 1 guntisb guntisb 3408 Apr 28 07:08 passtonext.c -rw-rw-r-- 1 guntisb guntisb 2132 May 17 14:48 passtonext.o [guntisb@zars mpi]$ more mfile localhost:4 [guntisb@zars mpi]$ [guntisb@zars mpi]$ make mpicc passtonext.c -o passtonext -lmpich -lm [guntisb@zars mpi]$ mpirun -np 2 -machinefile mfile passtonext guntisb@localhost's password: Nodename=zars.latnet.lv Rank=0 Size=2 INFO: zars.latnet.lv (0 of 2) sent 73 value to 1 of 2 INFO: zars.latnet.lv (0 of 2) received 74 value from 1 of 2 Nodename=zars.latnet.lv Rank=1 Size=2 INFO: zars.latnet.lv (1 of 2) received 73 value from 0 of 2 INFO: zars.latnet.lv (1 of 2) sent 73+1=74 value to 0 of 2 [guntisb@zars mpi]$ Paldies Jānim Tragheimam!!
3b mājas darba variants • Ja ir izpildits 3a mājas darbs, tad 3b daļā var uzrakstīt un notestēt pats savu MPI programmu, kas rēķinās ar vismaz 4 procesiem. MPI programmu drīkst nedarbināt Gridā!
Message-passing performance comparison SMP MPI performance GigabitEthernet MPI performance
AMD Opteron 800 HPCProcessing Node • HPC Strengths • Flat SMP like Memory Model: • All four reside with the same 248 memory map • Expandable to 8P NUMA • Glue-less Coherent multi-processing: • low Latency and high Bandwidth ~1600M T/sec (6.4 GB/s) • 32GB of High B/W external memory bus (>5.3GB/sec.) • Native high B/W memory map I/O (>25Gbits/sec.)
Sufficiently Uniform Memory Organization (SUMO) • Advantages • Software view of memory is SMP • Latency difference between local & remote memoryis a function of the number of processors in the node • 1P and 2P look like a SMP machine • 3P and 4P are NUMA like but can still be viewed as a ccUMA or asymmetric SMP node • >4P can be viewed as ccUMA and depending on CACHEhit rate, may or may not required NUMA aware OS • Physical address space is flat and can beviewed as fully coherent or not (MOEIS state) • DRAM can be contiguous or interleaved • Additional processor nodes bring true increased memory bandwidth • Designed for lower overall system chip count (glue-less interface) • Disadvantages • 3P and 4P nodes work better if the OS is “aware” of the memory map • >4P may require a NUMA (Non-uniform memory architecture) aware OS if the CACHE hit rate is low • 2.6.9 kernel needed to take full advantage of NUMA architecture of Opterons
SW3 SW1 SW0 SW2 SW2 SW3 SW3 SW2 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P 4P Future NUMA Systems Scaling beyond 8 Processor Interconnect Fabric • Scaling beyond 8P is enabled • External Coherent HyperTransport switch Coherent Interconnect • Snoop filter • Data caching • Up to 16 processors within the same 240 SPM memory space
High Density HPC ClusterSprayCool Technology from ISR • 16 cards • 16G-flops/card • 256G-flops peak throughput • 64GB of memory per card • 1TerraByte of sys. Memory • 240 cubic inches • 114M-flops/cubic inch • 4.27GB of memory storagecubic inch • ~6K watts • ~3 watts/cubic inch 16” 10” 14”
200-333MHz 9 byte Reg. DDR 200-333MHz 9 byte Reg. DDR AMD Opteron™ AMD Opteron 8GB DRAM 8-G DRAM AMD OpteronBeowulf 4P SMP Processing Node To AMD 8131 Tunnel To AMD 8131 Tunnel 200-333MHz 9 byte Reg. DDR • One 4P SMP node • 16G-flops • 32GB DRAM • 10GB/sec. Memory BW AMD Opteron 8GB DRAM 200-333MHz 9 byte Reg. DDR AMD Opteron 16x16 HyperTransport @1600MT/s VGA PCI Graphics AMD-8111TM I/O Hub Legacy PCI PCI-X PCI-X AMD-8131TM PCI-X Tunnel FLASH LPC SIO Management 100 BaseT Management LAN USB1.0 AC97 UDMA133 MII 10/100 Phy
Extending 32-bit instruction set • Intel and AMD scheme very similar • 48-bit virtual address space • 64-bit General Purpose Registers • Support 64-bit addressing and integer math • Eight extra GPR added • Eight extra XMM added • Difference—EM64T supports SSE3 instructions, Opteron has 3DNow!
Status of 64bit Linux • 64-bit Linux OS is a stable operating platform • Opteron CPU and associated platforms have sufficient reliability • Opteron CPU gives slightly better performance for significantly less power draw as Xeon. • Using 64-bit compilation and optimization can lead to significant performance gains on AMD and Intel.
64-bit 64-bit OS OS 0 GB 0 GB 32-bit 32-bit App App Virtual 4GB Virtual Virtual 12GB Virtual Memory Memory DRAM Memory DRAM Memory 2 GB 2 GB 32-bit 32-bit OS OS 4 GB 4 GB Increased Memory for 32-bit Applications 32-bit server, 4 GB DRAM • OS & App share small 32-bit VM space • 32-bit OS & applications all share 4GB DRAM • Leads to small dataset sizes & lots of paging Shared 64-bit server, 12 GB DRAM 0 GB 0 GB • App has exclusive use of 32-bit VM space • 64-bit OS can allocate each application large dedicated portions of 12GB DRAM • OS uses VM space way above 32-bits • Leads to larger dataset sizes & reduced paging 32-bit App 32-bit App Not shared 4 GB 4 GB Not shared Not shared 256 TB 256 TB
Compilers • Opteron 250 • Legacy executable, i386: 1290 VUPS • Gcc 3.4.2 optimized: 2440 VUPS • Pathscale compiler: 2677 VUPS • Xeon 3.6 • Legacy executable, i386: 1386 VUPS • Gcc 3.4.2 optimized: 2309 VUPS • Intel 8.1 compiler:2910 VUPS • Intel 8.1 compiler with profile feedback: 4332 VUPS • Intel Fortran and C 8.1 uses SSE3 instructions to optimize, makes it incompatible with Opterons. • For comparison PentiumIII 1.0 GHz=568 VUPS.
Computing Strategy: x86-64 • Legacy: 32-bit OS • Both AMD Athlon 64 and AMD Opteron processors run any 32-bit legacy O/S • Compatible all legacy Drivers, OS & BIOS • No application recompile required, no emulation layer • 64-bit OS • Desired applications can be written/ported to leverage the full 64-bitcapabilities of x86-64 • Migrate only where warranted, and at the user’s pace • 32-bit applications run under 64-bit OS • BIOS is standard x86 32-bit code. • Transfer to 64-bit operation occurs under OS load/startup control • 64-bit mode does not use segmentation - Flat addressing
USER KERNEL Compatibility Thunking Layer 64-bit Process 64-bit Process AMD64 Application IA32 Application Thunking Layer AMD64 Operating System AMD64 Device Drivers
USER 32-bit thread 64-bit thread 32-bitApplication 64-bit Application 4GB expanded address space 512GB (or 8TB) address space Translation 64-bit Operating System 64-bit Device Drivers KERNEL 64-bit OS & Application Interaction 32-bit Compatibility Mode • 64-bit OS runs existing 32-bit APPs with leading edge performance • No recompile required, 32-bit code directly executed by CPU • 64-bit OS provides 32-bit libraries and “thunking” translation layer for 32-bit system calls. 64-bit Mode • Migrate only where warranted, and at the user’s pace to fully exploit AMD64 • 64-bit OS requires all kernel-level programs & drivers to be ported. • Any program that is linked or plugged in to a 64-bit program (ABI-level) must be ported to 64-bits.
Problems 64/32 • In 64bit Linux you can't run binary only programs which are compiled for IA32 or applications which haven't been ported to AMD64 yet (e.g. OpenOffice.org). This is because you can't mix 32bit applications and 64bit libraries. You would also need the 32bit versions of the libraries to run a 32bit application. • Multiarch • architectures like sparc64 or powerpc64, which provide lib for default 32bit libraries and lib64 for extra 64bit libraries, default to executing 32bit applications • amd64 defaults to 64bit binaries because of the performance benefits it offers in 64bit mode. • Thus, not wanting to rewrite virtually every binary-arch package's creation rules to install libs in lib64 not lib, and wanting to find a solution for all multiarch capable platforms, various people are working on so-called multiarch support.
Sun Solaris 64bit The Solaris Operating System supports both 32-bit and 64-bit hardware. Customers with 32-bit hardware can run the Solaris Operating System and take advantage of the many features in the Solaris Operating System that are not explicitly related to 64-bits (e.g., dynamic reconfiguration, scalability enhancements, performance improvements). Customers can run a 32-bit application on 64- or 32-bit hardware with the Solaris Operating System without any change to the application. Note that Solaris for x86, prior to Solaris 10, supports only a 32 bit kernel.
MacOS X xcode At the heart of Xcode 2.0 is Apple’s version of gcc 3.5, the next generation of the industry-standard gcc compiler. The new compiler helps you get more performance from your existing code by using a number of advanced optimization techniques. Auto-vectorization, a technique borrowed from the world of supercomputing, helps you to unlock the power of the Velocity Engine in every PowerPC G4 and G5 system without writing vectorized code. With the new 64-bit support in Mac OS X Tiger, Xcode gives you the ability to create applications such as computation and rendering engines that use 64-bit memory addressing. This is ideal for data-intensive applications, which can run faster by accessing data in memory, rather than via disk access. Xcode gives you the tools for building and debugging 64-bit applications for PowerPC G5 and Mac OS X Tiger, as well as letting you create Fat Binaries that contain both 32-bit and 64-bit executables.
Kernel source • Download source fromwww.kernel.orgor Mirrors. • Unpack: • cd /usr/src • tar xzvf linux-<Version>.tar.gz • ln –s linux-<Version>.tar.gz linux • Source-Root: • /usr/src/linux
Kernel source tree . | |-- isdn | |-- asm-ppc |-- Documentation | |-- macintosh | |-- asm-ppc64 |-- arch | |-- net | |-- asm-sparc | |-- alpha | |-- parport | |-- asm-sparc64 | |-- arm | |-- pci | |-- asm-um | |-- i386 | |-- pcmcia | |-- asm-x86_64 | | |-- boot | |-- pnp | |-- linux | | | |-- compressed | |-- scsi | |-- math-emu | | | `-- tools | |-- sgi | |-- net | | |-- kernel | |-- sound | |-- pcmcia | | |-- lib | |-- usb | |-- scsi | | |-- math-emu | `-- video | `-- video | | `-- mm |-- fs |-- init | |-- ia64 | |-- autofs |-- ipc | |-- m68k | |-- ext2 |-- kernel | |-- ppc | |-- ext3 |-- lib | |-- ppc64 | |-- fat |-- mm | |-- sparc | |-- isofs |-- net | |-- sparc64 | |-- minix | |-- 802 | |-- um | |-- msdos | |-- appletalk | `-- x86_64 | |-- ntfs | |-- atm |-- drivers | |-- reiserfs | |-- bluetooth | |-- acpi | |-- smbfs | |-- core | |-- atm | |-- udf | |-- ethernet | |-- block | `-- vfat | |-- ipv4 | |-- bluetooth |-- include | |-- ipv6 | |-- cdrom | |-- asm -> asm-um | |-- ipx | |-- char | |-- asm-alpha | |-- irda | |-- hotplug | |-- asm-arm | |-- packet | |-- ide | |-- asm-i386 | |-- unix | |-- ieee1394 | |-- asm-ia64 | `-- x25 | |-- input | |-- asm-m68k `-- scripts
Linux Source Tree Layout /usr/src/linux scripts Documentation ipc kernel init net arch mm lib fs drivers include 802 appletalk atm ax25 bridge core decnet econet ethernet ipv4 ipv6 ipx irda khttpd lapb … acorn atm block cdrom char dio fc4 i2c i2o ide ieee1394 isdn macintosh misc net … adfs affs autofs autofs4 bfs code cramfs devfs devpts efs ext2 fat hfs hpfs … asm-alpha asm-arm asm-generic asm-i386 asm-ia64 asm-m68k asm-mips asm-mips64 linux math-emu net pcmcia scsi video … adfs affs autofs autofs4 bfs code cramfs devfs devpts efs ext2 fat hfs hpfs … alpha arm i386 ia64 m68k mips mips64 ppc s390 sh sparc sparc64
Sizes (linux-2.4.0-test2) size directory entries files loc 90M /usr/src/linux/ 19 7645 2.6M 4.5M Documentation 97 380 na 16.5M arch 12 1685 466K 54M drivers 31 2256 1.5M 5.6M fs 70 489 150K 14.2M include 19 2262 285K 28K init 2 2 1K 120K ipc 6 6 4.5K 332K kernel 25 25 12K 80K lib 8 8 2K 356K mm 19 19 12K 5.8M net 33 453 162K 400K scripts 26 42 12K
linux/arch • Subdirectories for each current port. • Each contains kernel, lib, mm, boot and other directories whose contents override code stubs in architecture independent code. • lib contains highly-optimized common utility routines such as memcpy, checksums, etc. • arch as of 2.4: • alpha, arm, i386, ia64, m68k, mips, mips64. • ppc, s390, sh, sparc, sparc64.
linux/drivers • Largest amount of code in the kernel tree (~1.5M). • device, bus, platform and general directories. • drivers/char – n_tty.c is the default line discipline. • drivers/block – elevator.c, genhd.c, linear.c, ll_rw_blk.c, raidN.c. • drivers/net –specific drivers and general routines Space.c and net_init.c. • drivers/scsi – scsi_*.c files are generic; sd.c (disk), sr.c (CD-ROM), st.c (tape), sg.c (generic). • General: • cdrom, ide, isdn, parport, pcmcia, pnp, sound, telephony, video. • Buses – fc4, i2c, nubus, pci, sbus, tc, usb. • Platforms – acorn, macintosh, s390, sgi.
linux/fs • Contains: • virtual filesystem (VFS) framework. • subdirectories for actual filesystems. • vfs-related files: • exec.c, binfmt_*.c - files for mapping new process images. • devices.c, blk_dev.c – device registration, block device support. • super.c, filesystems.c. • inode.c, dcache.c, namei.c, buffer.c, file_table.c. • open.c, read_write.c, select.c, pipe.c, fifo.c. • fcntl.c, ioctl.c, locks.c, dquot.c, stat.c.
linux/include • include/asm-*: • Architecture-dependent include subdirectories. • include/linux: • Header info needed both by the kernel and user apps. • Usually linked to /usr/include/linux. • Kernel-only portions guarded by #ifdefs • #ifdef __KERNEL__ • /* kernel stuff */ • #endif • Other directories: • math-emu, net, pcmcia, scsi, video.
linux/init • Just two files: version.c, main.c. • version.c – contains the version banner that prints at boot. • main.c – architecture-independent boot code. • start_kernel is the primary entry point.
linux/ipc • System V IPC facilities. • If disabled at compile-time, util.c exports stubs that simply return –ENOSYS. • One file for each facility: • sem.c – semaphores. • shm.c – shared memory. • msg.c – message queues.
linux/kernel • The core kernel code. • sched.c – “the main kernel file”: • scheduler, wait queues, timers, alarms, task queues. • Process control: • fork.c, exec.c, signal.c, exit.c etc… • Kernel module support: • kmod.c, ksyms.c, module.c. • Other operations: • time.c, resource.c, dma.c, softirq.c, itimer.c. • printk.c, info.c, panic.c, sysctl.c, sys.c.
linux/lib • kernel code cannot call standard C library routines. • Files: • brlock.c – “Big Reader” spinlocks. • cmdline.c – kernel command line parsing routines. • errno.c – global definition of errno. • inflate.c – “gunzip” part of gzip.c used during boot. • string.c – portable string code. • Usually replaced by optimized, architecture-dependent routines. • vsprintf.c – libc replacement.
linux/mm • Paging and swapping: • swap.c, swapfile.c (paging devices), swap_state.c (cache). • vmscan.c – paging policies, kswapd. • page_io.c – low-level page transfer. • Allocation and deallocation: • slab.c – slab allocator. • page_alloc.c – page-based allocator. • vmalloc.c – kernel virtual-memory allocator. • Memory mapping: • memory.c – paging, fault-handling, page table code. • filemap.c – file mapping. • mmap.c, mremap.c, mlock.c, mprotect.c.
linux/scripts • Scripts for: • Menu-based kernel configuration. • Kernel patching. • Generating kernel documentation.
Linux Kernel Configuration • Download the source code and extract it under /usr/src • Customize kernel configuration • make config • make menuconfig • make xconfig • Two menu items: Networking options and Network device support • Device support • Three options: y, m, n
Kernel Config • Get sources from kernel.org • Unpack in your home directory gzip -cd linux-2.4.XX.tar.gz | tar xvf - • make menuconfig • make dep • make bzimage
Building and Installing Kernel • Compiling the kernel. • make dep • [ make clean ] • make bzImage • make modules • Installing the kernel. • make modules_install • [ make install ] • Kernel stored in /usr/src/linux-2.4/arch/i386/boot/bzImage and copied to /boot • cp arch/i386/boot/bzImage /boot/ • Edit your lilo.conf and run /sbin/lilo • Reboot.
LILO v.s. GRUB • LILO • Run LILO to modify mini-bootloader in the MBR • Cannot read file system itself • GRUB • Multistage loader • Can read file-system itself • Parameter passing (runlevel, init) to kernel • Actually hacking – modifies address and name inside kernel for the process to start
Linux Loader (LILO) # sample /etc/lilo.conf boot = /dev/hda delay = 40 password=SOME_PASSWORD_HERE default=vmlinuz-stable vga = normal root = /dev/hda1 image = vmlinuz-2.5.99 label = net test kernel restricted image = vmlinuz-stable label = stable kernel restricted other = /dev/hda3 label = Windows 2000 Professional restricted table = /dev/hda
GRUB # /etc/grub.conf generated by anaconda timeout=10 splashimage=(hd0,1)/grub/splash.xpm.gz password --md5 $1$ÕpîÁÜdþï$J08sMAcfyWW.C3soZpHkh. title Red Hat Linux (2.4.18-3custom) root (hd0,1) kernel /vmlinuz-2.4.18-3custom ro root=/dev/hda5 initrd /initrd-2.4.18-3.img title Red Hat Linux (2.4.18-3) Emergency kernel (no afs) root (hd0,1) kernel /vmlinuz-2.4.18-3 ro root=/dev/hda5 initrd /initrd-2.4.18-3.img title Windows 2000 Professional rootnoverify (hd0,0) chainloader +1
/boot unix boot # pwd; ls -lRp /boot .: total 1582 lrwxrwxrwx 1 root root 1 Sep 23 14:11 boot -> ./ drwxr-xr-x 2 root root 1024 Sep 23 15:34 grub/ -rw-r--r-- 1 root root 458622 Sep 23 14:58 initrd-2.4.26-gentoo-r9 -rw-r--r-- 1 root root 1137878 Sep 23 14:50 kernel-2.4.26-gentoo-r9 drwx------ 2 root root 12288 Sep 23 13:49 lost+found/ ./grub: total 846 -rw-r--r-- 1 root root 30 Sep 23 15:34 device.map -rw-r--r-- 1 root root 11264 Sep 23 15:34 e2fs_stage1_5 -rw-r--r-- 1 root root 10256 Sep 23 15:34 fat_stage1_5 -rw-r--r-- 1 root root 9216 Sep 23 15:34 ffs_stage1_5 -rw-r--r-- 1 root root 245 Sep 23 15:34 grub.conf -rw-r--r-- 1 root root 1495 Sep 23 15:32 grub.conf.sample -rw-r--r-- 1 root root 11456 Sep 23 15:34 jfs_stage1_5 lrwxrwxrwx 1 root root 9 Sep 23 15:32 menu.lst -> grub.conf -rw-r--r-- 1 root root 9600 Sep 23 15:34 minix_stage1_5 -rwxr-xr-x 1 root root 196836 Sep 23 15:32 nbgrub -rwxr-xr-x 1 root root 197860 Sep 23 15:32 pxegrub -rw-r--r-- 1 root root 12864 Sep 23 15:34 reiserfs_stage1_5 -rw-r--r-- 1 root root 33856 Sep 23 15:32 splash.xpm.gz -rw-r--r-- 1 root root 512 Sep 23 15:34 stage1 -rw-r--r-- 1 root root 135148 Sep 23 15:34 stage2 -rwxr-xr-x 1 root root 196900 Sep 23 15:32 stage2.netboot -rw-r--r-- 1 root root 8896 Sep 23 15:34 vstafs_stage1_5 -rw-r--r-- 1 root root 12840 Sep 23 15:34 xfs_stage1_5 ./lost+found: total 0 unix boot # unix grub # more grub.conf default 0 # How many seconds to wait before the default listing is booted. timeout 5 title=gentoo root (hd0,0) kernel /kernel-2.4.26-gentoo-r9 root=/dev/ram0 init=/linuxrc ramdisk=8192 real_root=/dev/hda2 initrd /initrd-2.4.26-gentoo-r9 title GNU/Linux (2.4.25)root (hd0,4)kernel (hd0,4)/boot/kernel-2.4.25_pre6-gss root=/dev/ram0 init=/linuxrc real_root=/dev/hda5 vga=0x317 splash=verboseinitrd (hd0,4)/boot/initrd-2.4.25_pre6-gss unix grub #
Build kernel for other platform • unix linux # make menuconfig ARCH=x86_64 • /usr/src/linux/arch • alpha cris ia64 mips parisc ppc64 s390x sh64 sparc64arm i386 m68k mips64 ppc s390 sh sparc x86_64 • Cross-compile • make HOSTCC="gcc -m32" ARCH="x86_64" bzImage • get a gcc wrapper in order to crosscompile on a i386 host • http://www.jukie.net/~bart/debian/amd64/scripts/gcc.bart
Linux modules • driver modules in /lib/module [root@dafinn net]# pwd; ls /lib/modules/2.4.22-1.2166.nptlsmp/kernel/drivers/net 3c509.o b44.o eepro100.o netconsole.o pppox.o tg3.o 3c59x.o bonding epic100.o ns83820.o ppp_synctty.o tlan.o 8139cp.o de4x5.o ethertap.o pcmcia r8169.o tulip 8139too.o dl2k.o fealnx.o pcnet32.o sis900.o tun.o 82596.o dmfe.o irda ppp_async.o sk98lin typhoon.o 8390.o dummy.o mii.o ppp_deflate.o slhc.o via-rhine.o acenic.o e100 natsemi.o ppp_generic.o smc9194.o wireless amd8111e.o e1000 ne2k-pci.o pppoe.o starfire.o • Recompiling the kernel • Make the kernel smaller • Add a new device • Modify a system parameter