Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance

Size Matters: Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia

High peak compute power High communication overhead High peak memory bandwidth Limited memory space GPUs offer different characteristics Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

Motivating Question: How should we design applications to efficiently exploit GPU characteristics? • Context: • A bioinformatics problem: Sequence Alignment • A string matching problem • Data intensive (102 GB)

Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]: • A GPU port of the sequence alignment tool MUMmer[Kurtz 04] • ~4x (end-to-end) compared to CPU version (%) > 50% overhead Hypothesis: mismatch between the core data structure (suffix tree) and GPU characteristics

Idea: trade-off time for space • Use a space efficient data structure (though, from higher computational complexityclass): suffix array • 4x speedup compared to suffix tree-based on GPU Significant overhead reduction Consequences: • Opportunity toexploit multi-GPU systemsas I/O is less of a bottleneck • Focus is shifted towardsoptimizing the compute stage

Outline • Sequence alignment: background and offloading to GPU • Space/Time trade-off analysis • Evaluation

Background: sequence alignment problem Find where each query most likely originated from Queries 108 queries 101 to 102 symbols length per query Reference 106 to 1011 symbols length • CCAT GGCT... .....CGCCCTA GCAATTT.... ...GCGG • ...TAGGC TGCGC... ...CGGCA... ...GGCG • ...GGCTA ATGCG… .…TCGG... TTTGCGG…. • ...TAGG ...ATAT… .…CCTA... CAATT…. • ..CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Queries Reference

Sequence alignment Easy to partition Memory intensive GPU Massively parallel High memory bandwidth GPU Offloading: opportunity and challenges Opportunity • Data Intensive • Large output size • Limited memory space • No direct access to other I/O devices (e.g., disk) Challenges

subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space divide and compute in rounds Large output size compressed output representation (decompress on the CPU) GPU Offloading: addressing the challenges High-level algorithm (executed on the host)

Space/Time Trade-off Analysis

The core data structure massive number of queries and long reference =>pre-process reference to an index Past work: build a suffix tree(MUMmerGPU [Schatz 07, 09]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query

The core data structure massive number of queries and long reference => pre-process reference to an index subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Past work: build a suffix tree(MUMmerGPU [Schatz 07]) • Search: O(qry_len) per query • Space: O(ref_len), but the constant is high: ~20xref_len • Post-processing: O(4qry_len - min_match_len), DFS traversal per query Expensive Efficient Expensive

A better matching data structure Suffix Tree Suffix Array Less data to transfer Impact 1: reduced communication

A better matching data structure Suffix Tree Suffix Array Space for longer sub-references => fewer processing rounds Impact 2: better data locality is achieved at the cost of additional per-thread processing time

A better matching data structure Suffix Tree Suffix Array Impact 3: lower post-processing overhead

Evaluation

Evaluation setup • Testbed • Low-end Geforce 9800 GX2 GPU (512MB) • High-end Tesla C1060 (4GB) • Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09]) • Success metrics • Performance • Energy consumption • Workloads (NCBI Trace Archive, http://www.ncbi.nlm.nih.gov/Traces)

Speedup: array-based over tree-based

Dissecting the overheads Significant reduction in data transfers and post-processing Workload: HS1, ~78M queries, ~238M ref. length on Geforce

Summary • GPUs have drastically different performance characteristics • Reconsidering the choice of the data structure used is necessary when porting applications to the GPU • A good matching data structureensures: • Low communication overhead • Data locality: might be achieved at the cost of additional per thread processing time • Low post-processing overhead

Code available at: netsyslab.ece.ubc.ca

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance