150 likes | 371 Views
Last Level Cache (LLC) Performance of Data Mining Workloads on a CMP A Case Study of Parallel Bioinformatics Workloads. Aamer Jaleel Intel, VSSAD University of MD ajaleel@eng.umd.edu Aamer.Jaleel@intel.com. Matthew Mattina Tilera Corporation mmattina@tilera.com. Bruce Jacob
E N D
Last Level Cache (LLC) Performance of Data Mining Workloads on a CMPA Case Study of Parallel Bioinformatics Workloads Aamer Jaleel Intel, VSSAD University of MD ajaleel@eng.umd.edu Aamer.Jaleel@intel.com Matthew Mattina Tilera Corporation mmattina@tilera.com Bruce Jacob University of MD ECE Department blj@eng.umd.edu
CPU ????? ??????? ?????? Cache ???? DATABASE MEDICINE FINANCE WORLDS DATA INCREASING SPATIAL STOCK Recognition, Mining, and Synthesis (RMS) Workloads Paper Motivation • Growth of CMPs and Design Issues • Growth of Data and Emergence of New Workloads:
Paper Contributions • First to characterize memory behavior of parallel data-mining workloads on a CMP • Bioinformatics workloads • Sharing Analysis: • Varying amount of data shared between threads • Shared data frequently accessed • Degree of sharing is f(cache size) • Cache Performance Studies: • Private vs shared cache studies • Greater sharing better shared cache performance
Bioinformatics • Using software to understand, and analyze biological data • Why bioinformatics? • Sophisticated algorithms and huge data sets • Use mathematical and statistical methods to solve biological problems • Sequence analysis • Protein structure prediction • Gene classification • And many, many, more… Src: http://www.imb-jena.de/~rake/Bioinformatics_WEB
Parallel Bioinformatics Workloads • Structure Learning: • GeneNet – Hill Climbing, Bayesian network learning • SNP – Hill Climbing, Bayesian network learning • SEMPHY – Structural Expectation Maximization algorithm • Optimization: • PLSA – Dynamic Programming • Recognition: • SVM-RFE – Feature Selection • OpenMP workloads developed by Intel Corporation • Donated to Northwestern University, NU-MineBench Suite • http://cucis.ece.northwestern.edu/projects/DMS/MineBench.html • Also made available at: http://www.ece.umd.edu/biobench/
Experimental Methodology - Pin • Pin – x86 Dynamic Binary Instrumentation Tool • Developed at VSSAD, Intel • ATOM-like tool for Intel Xscale, IA-32, IPF Linux binaries • Provides infrastructure for writing program analysis tools – pin tools • Supports instrumentation of multi-threaded workloads • Hosted at: http://rogue.colorado.edu/Pin
The simCMPcache Pin tool • Instruments all memory references of an application • Gathers numerous cache performance statistics • Captures time varying behavior of applications
1 Core 3 Core 2 Core 4 Core Measuring Data Sharing • Shared Cache Line: • More than one core accesses the same cache line during its lifetime in the cache • Shared Access: • Access to a shared cache line • Active-Shared Access: • Access to a shared cache line and the last core current core • Ex: Accesses by core ids in redare active-shared accesses Core IDs: …1, 2, 2,2, 1, 3, 4, 3, 2, 2, 2… C3 C2 Shared Cache C0 C1
100 100 80 60 80 40 20 60 0 PLSA GeneNet SEMPHY SNP SVM 40 Access Frequency 20 0 16MB 32MB 64MB 4MB 8MB Data Sharing Behavior 4 Thread How Much Shared? (4 Threaded Run) 3 Thread 2 Thread (4 Threaded Run) 1 Thread • Sharing is dependent on algorithm and varies with cache size • Workloads fully utilize a 64MB LLC • Reducing cache misses improves data sharing • Despite size of shared footprint, shared data frequently referenced Cache Miss
1 Thread 2 Thread 3 Thread 4 Thread Sharing Phase Dependent & f (cache size) 4 MB LLC 16 MB LLC 64 MB LLC How Much Shared? (a) SEMPHY SEQUENTIAL SEQUENTIAL SEQUENTIAL How Much Shared? (b) SVM 4 Threaded Run:
Shared/Private Cache – SEMPHY Private Cache (16MB TOTAL LLC, 4MB/CORE) Miss Rate Shared Cache (16MB TOTAL LLC) Miss Rate Total Instructions (billions) • SEMPHY with 4-threads • Shared cache out-performs private caches
Cache Miss 1 Thread 2 Thread 3 Thread 4 Thread Shared Refs & Shared Caches… (4 Threaded Run) • Phase A: Shared caches perform better than private caches (25%) • Phase B: Shared caches marginally better than private caches (5%) • Shared caches BETTER when shared data frequently referenced • Most workloads frequently reference shared data A B % Total Accesses GeneNet – 16MB LLC Private LLC Miss Rate Shared LLC Miss Rate
Summary • This Paper: • Memory behavior of parallel bioinformatics workloads • Key Points: • Workloads exhibit a large amount of data sharing • Data sharing is a function of the total cache available • Eliminating cache misses improves data sharing • Shared data frequently referenced • Shared caches outperform private caches especially when shared data is frequently used
Ongoing Work on Bio-Workloads University of Maryland BioBench: A Benchmark Suite for Bioinformatics Applications BioParallel: Parallel Bioinformatics Applications Suite (In Progress) Brought to you by Maryland Memory-Systems Research "BioBench: A benchmark suite of bioinformatics applications." K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung. Proc. 2005 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2005), pp. 2-9. Austin TX, March 2005. http://www.ece.umd.edu/biobench/