440 likes | 627 Views
Optimizing Android Performance with GCC Compiler. Name - Geunsik Lim e-Mail - leemgs.at.gmail.com Nick - invain ( 인베인 ) Blog - http://blog.naver.com/invain /. Mar-12-2010, Fri. 본 문서는 자유롭게 수정 및 재배포가 가능 하나 , 자료의 재사용시 “ 자료출처 ” 를 우측하단에 표기해야 합니다. CONTENTS.
E N D
Optimizing Android Performance with GCC Compiler • Name - Geunsik Lim • e-Mail - leemgs.at.gmail.com • Nick - invain (인베인) • Blog - http://blog.naver.com/invain/ Mar-12-2010, Fri • 본문서는 자유롭게 수정 및 재배포가 가능 하나, 자료의 재사용시 “자료출처”를 우측하단에 표기해야 합니다.
CONTENTS Android Technology Session Optimization Strategies for the lightweight android Android Toolchain RoadmapBuilding Android Toolchain GA Search For Compiler OptionsThoughtful abstraction & specificationsProfile-Guided OptimizationFDO Illustration & PerformanceLightweight IPO (LIPO)Redundancy EliminationOptimizing Dalvik Memory ManagementObservation of WebView Bench & FhourstonesExperimental Result Systematic Optimizations http://leemgs.fedorapeople.org • Reference: GCC internals manual, Shih-wei Liao’s Paper, Dan Kegel’s crosstool, Fedora11 documentation(SMP)
What is Optimization? • In mathematics and computer science, mathematical programming, refers to choosing the best element from some set of available alternatives. • The first optimization technique, which is known as steepest descent, goes back to Gauss (mathematician and scientist). • This means solving problems in which one seeks to minimize or maximize a real function by systematically choosing the values of real or integer variables from within an allowed set. • Studies in optimizing: Code size, Performance, Power Embedded s/w size 2000 2005 2010 2015 2020 2025 2030
Where is a Hole for Optimization? Application ? Middleware ? (Application framework, Application) (Dalvik, Core/Func lib) OS Kernel ? (Linux) Hardware ? (Snapdragon,S5PC1XX)
7 Optimization Strategies 1/2 1) Data-driven tool deployment: • Regularly evaluate & then leverage the winner among optimizing toolchains 2) Judicious abstraction & specifications: A fundamental methodology • Visibility of a function should match the API spec in programmer’s design • Tradeoff in splitting into Java and Native: This interface affects performance • PacketVideo(=Opencore/OpenMax; Multimedia framework): Semiconductor industry looks for APIs to differentiate 3) Systematic parameter setting: A key driver in performance/size
7 Optimization Strategies 2/2 • 4) Profile-guided optimizations: A useful methodology • Feedback-Directed Optimizations (FDO): Build-Run-Build with our arm-xxx-eabi-gcc • Class loading profiler (aka Preload profiler): Zygote’s preloading Trade-off between boot-up time and app init time. • 5) Scope-enhancing optimizations: Interprocedural optimizations via arm-xxx-eabi-gcc –fripa • In the current implementation, -fripa only turns on cross module inlining analysis. • 6) Redundancy elimination: Identical Code Folding (ICF) • 7) Memory management optimization in Dalvik in the interest of time.
Data-driven tool deployment • Analyze the tools candidates: • Source: google • 원가경쟁력 • 제품 차별화 Size improvement on Dream phone Speedup on Dream phone (Run 100X) • Google track 13 numbers daily. They got space to show 4 here.
Analyze 6 Toolchains • Based on Google Android perflab benchmark results, – Baseline: Donut(ver1.6)’s toolchain: gcc-4.2.1 – Size: • Both gcc-4.4.X : 17.8% improvement • Both gcc-4.3.3 & gcc-4.3.3 Code Sourcery Version: 15% better • gcc-4.3.1: 3% improvement – Performance: No significant variance among 6 toolchains - gcc-4.4.3’s size benefit comes with no performance penalty • Code Sourcery for ARM doesn’t have significant performance / size benefit over Android’s version of gcc. – Code Sourcery’s strength: Addressing ARM’s hardware errata early. We have to port the fixes to gcc-4.4.3 • gcc-4.4.3 wins Toolchain moved to 4.4.3; Skipping 4.3
Android Toolchain Roadmap • All pieces from open source • GCC, binutils, gdb, gmp, mpfr • Patch for bug fixing and optimization • Take patches from upstream • Submit our patches to upstream • Also, native developers can use Android NDK • http://developer.android.com/sdk/android-2.1.html (API Level 7, Jan 2010) Branch S/W
Latest Android Toolchain • Google changed default cross-compiler on Nov-16-2009. • Default architecture is still armv5te for compatibility.
Building Android Toolchain (1/2) • Android uses Bionic C library • BSD license: Keeps GPL out of user’s sphere for Android market. • Small and fast more than glibc , uclibc. . glibc 2.11 : /lib/libc.so 1,208,224 bytes. uClibc 0.9.30: /lib/libc.so 424,235 bytes. Bionic éclair : /system/lib/libc.so 243,948 bytes • Bionic has built-in support for important Android specific services, - e.g., system properties, logging • Very limited support for POSIX, C++, etc • If need libstdc++-v3: • Enable libstdc++-v3 when configure the toolchain. • Statically link in the necessary components . -/system/lib/libstdc++.so ( 5,124bytes) • Reduce size extremely.
Building Android Toolchain (2/2) • Barebone-style building: • Inside Android tree • Specify all system and bionic header file paths, shared library, paths, libgcc.a, crtbegin_*.o, crtend_*.o, etc. • Standalone-style building: • Latest prebuilt gcc-4.4.0 toolchain • Convenient for native developers: arm-xxx-eabi-gcc -mandroid --sysroot=<path-to-sysroot > hello.c -o hello (<path to sysroot> is a pre-compiled copy of Bionic) • Download: Old) http://android.git.kernel.org/?p=platform/prebuilt.git;a=tree;f=linux-x86/toolchain;h=1cf27fca792be850f7b18e0c76762787c7b5c8c9;hb=4b06260a916be762d0dd1b93e97306f1b90e3889 Now) http://android.git.kernel.org/pub/?C=M;O=D
Thread functions according to bionic Thread API List eclair • Bionic libraryincludes POSIX C thread libraries with /system/lib/libc.so file.(./bionic/libc/include/pthread.h) • Android's POSIX thread api don’t support pthread_rwlock_*** , thread_rwlock_attr_*** , pthread_barrior_***, pthread_barrior_attr_***, pthread_spin_*** for POSIX 1003.1J-2000 Standard. • Android toolchain consist of GDB utility using /system/lib/lib_thread_db.so for thread debugging of Android application.
How to compile android source faster 1/2 • Utilize your Linux Desktop based on multi-core to build Android. • The purpose of the “make(by Paul Smith)” utility is to determine automatically which pieces of a large program need to be recompiled, and issue the commands to recompile them. • The `-j' or `--jobs' option tells make to execute many commands simultaneously. • This is a Bash shell script to compile of android full sources quickly. F11-invain#> vi build-android-kernel.sh#!/bin/bash# created by invain for the best performance when compiling kernel source.realnum=`cat /proc/cpuinfo | grep cores | wc -l `let bestnum=$realnum+$(printf %.0f `echo "$realnum*0.2"|bc`)schedtool –B –n 1 –e make -j `echo $bestnum`uImage
How to compile android source faster 2/2 • Evaluation when compiling android full sources. Tested on Intel Core i5 Lynfield 750 (Quad @2.66Ghz) by DeolPooltime make -j4 : 19m 10stime make -j5 : 18m 52s Recommendation time make -j8 : 19m 15stime make -j64 : 19m 54s Tested onIntel Core2 Quad Yourkfield Q9400 (Quad @2.66Ghz) by invaintime make -j4 : 22m 49stime make -j5 : 22m 31s Recommendationtime make -j8 : 28m 47stime make -j64 : 51m 19s ConnectBot
How to confirm 32bit/62bit about CPU & Linux • CPU Core Specification [invain@fedora11 ~]$ grep flag /proc/cpuinfoflags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 lahf_lm tpr_shadow vnmi flexpriority • lm flag is abbreviation of “Long Mode(64bit)”. • Linux Kernel Information [invain@fedora11 ~]$ uname -aLinux invain 2.6.33-rt4-smp #1 SMP Tue Feb 2623:11:04 UTC 2010x86_64 x86_64 x86_64 GNU/Linux
Thoughtful abstraction & specifications 1 Linux-arm.mk -fvisibility=hidden + 1. Goal: Visibility of a function should match the API spec in programmer’s design. 2. Solution: First, systematically applying the 5 steps. Fundamentally, need to go through the APIs of each library: • Consciously decide what should be “public” and what shouldn’t. 3. Result: ~500 KB savings for Opencore libs 4. Key: The whole hidden functions can be garbage collected if unused locally: 5. Toolchain’s options: -ffunction-sections, -Wl,--gcsections, Android.mk 2 *.h __attribute__((visibility(“public”))) function decl 3 invain@fedora11$> make -j <???> 4 /tmp/GoOgLe.o: In function foo Bar.c: undefined reference to “baz” Until no failure 5 __attribute__((visibility(“public”))) Int baz;
Parameter Setting • • Parameters setting is a key driver in performance/size optimizations • • Case study: For Android tree, find the best: • Compiler parameters • Compiler options • • Parameter space exploration via genetic algorithm. (GA) Genetic algorithm (GA)? a search technique used in computing to find exact or approximate solutions to optimization and search problems. Ref http://www.genetic-programming.com
GA Search For Compiler Options initialization Selection Initial a population of random generated option sets Drop a portion of the option Sets that build binaries with Lower fitness values Termination Termination Reproduction An expected result Reaches or we don’t Have enough time for searching Produce new option sets by Crossover and mutation of The remaining ones
Options That Control Optimization These options control various sorts of optimizations. • “-O0”: Reduce compilation time and make debugging produce the expected results. This is the default. • “-O1”: Optimizing compilation takes somewhat more time, and a lot more memory for a large function. • “-O2”: Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. For Kernel/App. • “-O3”: Turns on all optimizations specified by -O2 and also turns on the -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload and -ftree-vectorize options. • “-Os”: Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size
Reduce Code Size by Option Search • We search for a configuration that reduces size the most using compiler option search approach (unit: byte) • Android default inline options: -finline-functions -fno-inline-functions-called-once • Options that we found: -finline -fno-inline-functions -finline-functions-called-once --param max-inline-insns-auto=62 --param inline-unit-growth=0 --param large-unit-insns=0 --param inline-call-cost=4 GCC-4.2.1 GCC-4.4.3 GCC-4.4.3(tuned) Native system image size
Profile-Guided Optimization: Toolchainenables FDO (Feedback-Directed Optimization) • Must spill tmp1 or tmp2 Before defining tmp3 tmp1 = . . . tmp2 = . . . . . . tmp3 = . . . . . . = tmp1 . . . = tmp2 . . .
Instrumentation Based FDO 1. Build twice. 2. Find representative input 3. Instrumentation run: 2~3X slower but this perturbation is OK, because threading in Android is not that time sensitive (After all, ARM11 or Coretex-A8 core) 4. 1 profile per file, dumped at application exit. 1 3 Optimized Binary with FDO arm-xxx-eabi-gcc –fprofile-generate=./profile . . . arm-xxx-eabi-gcc –fprofile-use=./profile.zip . . . Instrumented Binary 2 Profile.zip Run the instrumented binary Representative Input Data • http://gcc.gnu.org/onlinedocs/gcc-4.4.3/gcc.pdf (Page 102)
FDO Performance Global hotness for ARM (HOT_BB_COUNT_FRACTION, Branch prediction routine for the GNU compiler, gcc-4.4.x/gcc/predict.c) • 1% improvement on android's skia library asbelows. • smaller effects on smaller android benchmarks. (unit: bytes) • Source: google
Scope-Enhancing OptimizationInter-Procedural Optimizations (IPO) parent.c: • int foo(int i, int j) • { • return bar (i,j) + bar (j,i); • } child.c: • int bar(int i, int j) • { • return i - j; • } • Optimization opportunity Decided by scope of the codecompiler can see • Scope limited mainly by artificial source boundaries IPO enhances the scope
Problem with Traditional IPO • Parameters setting is a key driver in performance/size optimizations • Case study: For Android tree, find the best: • Compiler parameters • CMI: Cross Module Inlining
Solution: Profile Feedback Based Lightweight IPO (LIPO) • To get the best potential out of IPO Integrate IPOwith FDO, seamlessly! • perf (IPO + FDO) > perf (IPO) + perf (FDO) • Move Inter-Procedural Analysis (IPA) to the end of training runexecution,into the binary -- make global decisions earlier! • Write IPA results into profile • During profile-use compilation, • Compile each file, as usual, with augmented profile • Read additional IPA results • Suck in auxiliary modules and extend scope ☞ Memo http://gcc.gnu.org/wiki/LightweightIpo
LIPO Improves Performance: Use -fripa • LIPO targets C/C++: Android uses C/C++. (except for some assembly code) • Baseline: FDO enabled • Degradations are in noise range. We just got the ARM version of LIPO to work: • Run: f11#> arm-xxx- eabi-gcc –fprofilegenerate=/data/local/profile –fripa -mandroid • Replace: –fprofile-generate with –fprofile-use at the end of optimization
Performance Evaluation Result The Standard Performance Evaluation Corporation (SPEC) is a non-profit corporation formed to establish, maintain and endorse a standardized set of relevant benchmarks that can be applied to the newest generation of high-performance computers. (http://www.spec.org/)
Redundancy Elimination: Identical Code Folding (ICF) • Identify identical functions and merge them at link time. • Implemented in the binutils gold linker. • Triggered with option --icf. • Debug support available through call tables. • ICF on gold yields 5% on x86-64 binaries • We are still getting gold linker to work with Android ARM. We estimate ~5% further Android size reduction on top of garbage collection. Stay tuned.
Optimizing Memory Management • Each Dalvik(by Dan Bornstein) Virtual Machine has its own heap • Dalvik use dlmalloc API to manage its heap • Allocate memory by mspace_calloc • Release memory by mspace_free Dalvik new object lease object mspace_free mspace_calloc Dalvik Heap
Various Headrooms forMemory Management Optimizations • Various Headrooms forMemory Management Optimizations. • Some of them have the same size . . . [Ljava/util/HashMap$Entry;:24 Ljava/util/HashMap$Entry;:24 Landroid/webkit/PerfChecker;:16 Landroid/webkit/LoadListener;:156 Landroid/webkit/ByteArrayBuilder;:20 Ljava/util/LinkedList;:20 Ljava/util/LinkedList$Link;:20 Ljava/util/LinkedList;:20 Ljava/util/LinkedList$Link;:20 Ljava/lang/String;:24 Ljava/lang/String;:24 Landroid/webkit/FrameLoader;:48 Ljava/lang/String;:24 . . . Objects Allocation log in WebViewBench High ratio object Sizes in WebViewBench
Observation of WebView Bench • The size ratio between allocation and release is almost same
Observation of Fhourstones (FreeBSD benchmarks) • This integer benchmark solves positions in the game of connect-4, as played on a vertical 7x6 board. • Ratio of Size = 44 is extremely high in this case • http://homepages.cwi.nl/~tromp/c4/Fhourstones.tar.gz
Many Objects Alloc/Released in Short Time • Optimization: Add a buffer cache of memory chunks • Buffer Cache: Release Release a String Object. (size = 24) Dalvik Memory Chunk (size = 24) Buffer Cache Dalvik Heap
Buffer Cache: Allocate Do you have memory chunk which size = 24 ? I need String Object. (size = 24) Dalvik Memory Chunk (size=24) Memory Chunk (size = 24) Buffer Cache Dalvik Heap
Experimental Result 1/2 • Allocation Performance Improvement in Fhourstones • Release Performance Improvement in Fhourstones No Pool 16,384 65,536 No Pool 16,384 65,536 Buffer cache slots Buffer cache slots • Source: google
Experimental Result 2/2 • Allocation Performance Improvement in WebViewBench • Release Performance Improvement in WebViewBench No Pool 16,384 65,536 No Pool 16,384 65,536 Buffer cache slots Buffer cache slots • Source: google
Summary • Systematic Optimizations 1. Toolchain: Regularly evaluate and leverage • E.g., leverage the newest lightweight IPO and ICF 2. There is no substitute for thoughtful abstraction &Specifications 3. Systematic parameter setting: A key driver to performance 4. Data-driven: Profile it 5. Optimizing memory time for Android/Dalvik is important.
Quiz#1) Throughput according to /init daemon ./android-2.1/system/core/sh/init.c (minimal bootable environment ) • 안드로이드 플랫폼에서 프로세스들의 조상인 /init실행 파일의 경우 Static build하여만든 init을 실행하면, Shared(Dynamic) build하여 만든 init을 실행하면 Shared Build한 후 Pre-link기술 적용 후 init을 실행하면 Toolbox 소스 사이즈가 작을 때는 Static build를, 소스가 클 때는 Shared build를 하여 init을 실행하면 Power On시, 가장 이상적으로 QuickBoot를 할 수 있다.
Quiz#2) License Issue of C++ standard lib • Android Platform의 rootfs에 사용되는 C++ 표준 라이브러리(/system/lib/libstdc++.so)는 GPL 라이센스입니다. 그렇다면, 이 라이브러리내의 함수들을 링크하여 동작하는 Userspace의 코드(예: *.apk)들은 고객이 요청시 소스가 모두 공개되어야 할까요? 1) 당연하다. 고객이 요청한다면 해당 상용 애플리케이션은공개 해야 한다. 2) 공식적으로 안드로이드는 Apache License이므로, 공개하지 않아도 된다. 3) C++ 표준 Lib가 GPL이라 하더라도, 예외조항 전문을 제품매뉴얼에 표기하여 애플리케이션의 소스를 고객에게 공개를 하지 안해도 된다. 4) 애플리케이션 구매자에게는 공개해야 할 의무가 있고, 비구매자의 요청에 대해서는 공개하지 않아도 된다. 5) 애플리케이션 판매자가 재빨리 전화번호 변경 후, 잠시 隱遁하면 되는 일이다.
Quiz#3) How to get free memory maximumly • 아래 그림에서 사용 가능한 RAM 용량을 봅시다. Before(54MB)이고, After(93MB)입니다. 대략 2배 정도의 차이를 보이고 있습니다. 그 원인을 무엇일까요? • Before • After