360 likes | 439 Views
Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany. Networked Systems Laboratory (NetSysLab) University of British Columbia. A golf course ….
E N D
Harvesting the Opportunity of GPU-based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany
Networked Systems Laboratory (NetSysLab) University of British Columbia A golf course … … a (nudist) beach (… and 199 days of rain each year)
Hybrid architectures in Top 500 [Nov’10]
Hybrid architectures • High compute power / memory bandwidth • Energy efficient [operated today at low efficiency] • Agenda for this talk • GPU Architecture Intuition • What generates the above characteristics? • Progress on efficiently harnessing hybrid (GPU-based) architectures
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian
Feed the cores with data Idea #3 The processing elements are data hungry! Wide, high throughput memory bus
10,000x parallelism! Idea #4 Hide memory access latency Hardware supported multithreading
Host Machine Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Instruction Unit Registers Registers Registers Host Memory Core 1 Core 2 Core M Constant Memory Texture Memory Global Memory The Resulting GPU Architecture nVidia Tesla 2050 • 448 cores • Four ‘memories’ • Shared • fast – 4 cycles • small – 48KB • Global • slow – 400-600cycles • large – up to 3GB • high throughput – 150GB/s • Texture – read only • Constant – read only • Hybrid • PCI 16x -- 4GBps GPU
High peak compute power High host-device communication overhead Complex to program High peak memory bandwidth Limited memory space GPUs offer different characteristics
Projects at NetSysLab@UBChttp://netsyslab.ece.ubc.ca • Porting applications to efficiently exploit GPU characteristics • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February 2011. • Middleware runtime support to simplify application development • CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR • GPU-optimized building blocks: Data structures and libraries • GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 • Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 • A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 • On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08
Motivating Question: How should we design applications to efficiently exploit GPU characteristics? • Context: • A bioinformatics problem: Sequence Alignment • A string matching problem • Data intensive (102 GB) Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10
Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • ~4x (end-to-end) compared to CPU version (%) > 50% overhead Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics
Idea: trade-off time for space • Use a space efficient data structure (though, from higher computational complexityclass): suffix array • 4x speedup compared to suffix tree-based on GPU Significant overhead reduction Consequences: • Opportunity toexploit multi-GPU systemsas I/O is less of a bottleneck • Focus is shifted towardsoptimizing the compute stage
Outline for the rest of this talk • Sequence alignment: background and offloading to GPU • Space/Time trade-off analysis • Evaluation
Background: Sequence Alignment Problem Problem: Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length (up to ~400GB) • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference
Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth GPU Offloading: Opportunity and Challenges Opportunity • Data Intensive • Large output size • Limited memory space • No direct access to other I/O devices (e.g., disk) Challenges
subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds search-optimized data-structures Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)
The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len) but the constant is high ~20 x ref_len • Post-processing: DFS traversal for each query O(4qry_len - min_match_len)
The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive
A better matching data structure? Suffix Tree Suffix Array Less data to transfer Compute Impact 1: Reduced communication
A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Compute Impact 2: Better data locality is achieved at the cost of additional per-thread processing time
A better matching data structure Suffix Tree Suffix Array Compute Impact 3: Lower post-processing overhead
Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)
Dissecting the overheads Significant reduction in data transfers and post-processing Workload: HS1, ~78M queries, ~238M ref. length on GeForce
Comparing with CPU performance[baseline single core performance] [Suffix tree] [Suffix array] [Suffix tree]
Summary • GPUs have drastically different performance characteristics • Reconsidering the choice of the data structure used is necessary when porting applications to the GPU • A good matching data structureensures: • Low communication overhead • Data locality: might be achieved at the cost of additional per thread processing time • Low post-processing overhead
Code, benchmarks and papers available at: netsyslab.ece.ubc.ca