Hong Jiang ( Yifeng Zhu, Xiao Qin, and David Swanson )

ACase Study of Parallel I/O for Biological Sequence Search on Linux Clusters Hong Jiang (Yifeng Zhu, Xiao Qin, and David Swanson) Department of Computer Science and Engineering University of Nebraska – Lincoln April 21, 2004

Motivation (1): Cluster Computing Trends Cluster computing is the fastest growing computing platform. 208 clusters were listed among the Top 500 by Dec. 2003. The power of cluster computing tends to be boundedby I/O

Motivation (2) • Biological sequence search is bounded by I/O • Databases are exploding in sizes • Growing at an exponential rate • Exceeding memory capacity and thrashing • Accelerating speed gap between CPU and I/O performance in low memory CPU utilization % • Goals • I/O characterization of biological search • Faster search by using parallel I/O • Using inexpensive Linux clusters • An inexpensive alternative to SAN or high end RAID database size

Talk Overview • Background: biological sequence search • Parallel I/O implementation • I/O characterizations • Performance evaluation • Conclusions

databases homologous sequences unknown sequences Biological Sequence Search • Unraveling life mystery by screening databases • Retrieving similar sequences from existing databases is important. • With new sequences, biologists are eager to find similarities. • Used in AIDS and cancer studies Sequence Similarity  Structural Similarity  Functional Similarity

Parallel BLAST • Sequence comparison is embarrassingly parallel • Divide and conquer • Replicate database but split query sequence • Examples: HT-BLAST,cBLAST, U. Iowa BLAST, TurboBLAST • Split database but replicate query sequence • Examples: mpiBlast, TuboBlast, wuBLAST BLAST: Basic Local Alignment Search Tool • BLAST • Most widely used similarity search tool • A family of sequence database search algorithms Query: DNA Protein Database: DNA Protein

Large Biology Databases • Challenge: databases are large. • Public Domain • NCBI GenBank • 26 millionsequences • Largest database 8GBytes • EMBL • 32 million sequences • 120 databases • Commercial Domain • Celera Genomics • DoubleTwist • LabBook • Incyte Genomics Doubling every 14 months

Parallel I/O disk disk disk disk Alleviate I/O Bottleneck in Clusters The following data is measured on the PrairieFire Cluster at University of Nebraska: PCI Bus 264 MB/s=33Mhz*64 bits W: 209 MB/s R: 236 MB/s Memory W: 592 MB/s R: 464 MB/s C: 316 MB/s disk Disk W: 32 MB/s R: 26 MB/s

Overview of CEFT-PVFS • Motivations • High throughput: exploit parallelism • High reliability: Assume MTTF of a disk is 5 years, MTTF of PVFS = MTTF of 1 node ÷ 128 = 2 weeks • Key features • Extension of PVFS • RAID-10 style fault tolerance • Free storage service, without any additional hardware costs • Pipelined duplication • “Lease” used to ensure consistency • Four write protocols to stride different balances between reliability and write throughput • Double parallelism for read operations • Skip hot spots for read

0 1 2 3 4 5 6 7 8 9 10 11 Client Node data data data data server 1 server 2 server 1' server 2' Primary Backup Group Group 0 2 4 6 8 10 1 3 5 7 9 11 0 2 4 6 8 10 1 3 5 7 9 11 Double Read Parallelism in CEFT-PVFS Myrinet Switch Interleaved data from both groups are read simultaneously to boost the peak read performance.

0 1 2 3 4 5 6 7 8 9 10 11 Client Node data data data data server 1 server 2 server 1' server 2' Primary Backup Group Group 0 2 4 6 8 10 1 3 5 7 9 11 0 2 4 6 8 10 1 3 5 7 9 11 Skip Heavily Loaded Server Myrinet Switch Heavily loaded node

mpiBLAST NCBI BLAST Lib Parallel I/O Libs Software stack Parallel BLAST over Parallel I/O • Avoid changing the search algorithms • Minor modification to I/O interfaces of NCBI BLAST For example: read(myfile, buffer, 10) pvfs_read(myfile, buffer, 10) ceft_pvfs_read(myfile, buffer, 10) • Arbitrarily choose mpiBLAST • This scheme is general to any parallel blasts as long as they use NCBI BLAST ( This is true in most case. ☺)

CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Original mpiBLAST mpiBLAST Worker mpiBLAST Master mpiBLAST Worker mpiBLAST Worker mpiBLAST Worker Switch • Mastersplits database into segments • Master assigns and copies segments to workers • Workersearches itsown diskusing entire query

CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk mpiBLAST over PVFS/CEFT-PVFS mpiBLAST Master mpiBLAST Worker mpiBLAST Worker mpiBLAST Worker mpiBLAST Worker Metadata Server Data Server Data Server Data Server Data Server Switch • Database is striped over data servers; • Parallel I/O services for each mpiBLAST worker. • Fair comparisons: Overlap servers and mpiBLASTs Do not count the copy time in original mpiBLAST

CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk mpiBLAST over CEFT-PVFS with one hot spot skipped mpiBLAST Master mpiBLAST Worker mpiBLAST Worker mpiBLAST Worker mpiBLAST Worker Hot Spot Metadata Server Data Server 1 Data Server 2 Data Server 1’ Data Server 2’ Switch • Double the degree of parallelism for reads • Hot spotmay exist since servers are not dedicated • Exploit redundancy to avoid performance degradation

Performance Measurement Environments • PrairieFire Cluster at UNL • Peak performance: 716 Gflops • 128 IDEs, 2.6 TeraBytes total capacity • 128 nodes, 256 processors • Myrinet and Fast Ethernet • Linux 2.4 and GM-1.5.1 • Performance • TCP/IP: 112 MB/s • Disk writes: 32 MB/s • Disk reads: 26 MB/s

I/O patterns of a specific case: 8 workers and 8 fragments • 90% are read operations. • Large reads and small writes • I/Os are out of phase gradually Asynchronousness is great!!! 8 mpiBLAST workers on 8 fragments I/O Access Patterns • Database:nt • largest at NCBI web • 1.76 million sequences, size 2.7 GBytes • Query sequence:subset of ecoli.nt (568 bytes) • Statistics shows typical query length within 300-600 bytes

Focus attention on read operations due to their dominance. • I/O patterns with 8 workers but different number of fragments I/O Access Patterns (cont) • Dominated by large reads. • 90% of all accessed data in databases are read by only 40% of the requests. Read request size and number when the database fragment number changes from 5 to 30.

mpiBLAST vs mpiBLAST-over-PVFS Comparison: assuming the SAME amount of resources -8% 28% 27% 19% • With 1 node, mpiBLAST-over-PVFS performs worse. • Parallel I/O benefits the performance. • Gains of PVFS become smaller as node number increases.

Parallel I/O is better even with less resources saturated 4 CPUs, 2 disks 4 CPUs, 4 disks better worse mpiBLAST vs mpiBLAST-over-PVFS (cont) Comparisons: assuming DIFFERENT amount of resources • mpiBLAST-over-PVFS outperforms due to parallel I/O. • Gain from parallel I/O saturates eventually. (Amdahl’s Law) • As database scales up, grain becomes more significant.

mpiBLAST-over-PVFS vs. over-CEFT-PVFS Comparisons: assuming different number of mpiBLAST workers -1.5% • mpiBLAST is read-dominated • Same degree of parallelism in PVFS and CEFT-PVFS for read • Overhead in CEFT-PVFS is negligible PVFS: 8 data servers; CEFT-PVFS: 4 mirrors 4. Both cases use 8 disks. -10% -3.6% -2.6%

mpiBLAST-over-PVFS vs over-CEFT-PVFS Comparisons when one disk is stressed. 1. M = allocate(1 MBytes); 2. Create a file named F; 3. While(1) 4. If(size(F) > 2 GB) 5. Truncate F to zero byte; 6. Else 7. Synchronously append the 8. data in M to the end of F; 9. End of if 10. End of while • Configuration: PVFS 8 servers, CEFT-PVFS: 4 mirrors 4 • Stress disk on one data server by repeatedly appending data. • While the original mpiBLAST and one over PVFS degraded by 10 and 21 folds, our approach degraded only by a factor of 2. disk stress program

Conclusions • Clusters are effective computing platforms for bioinformatics applications. • A higher degree of I/O parallelism may not lead to better performance unless I/O demand remains high (larger databases). • Doubling the degree of I/O parallelism for read operations provides CEFT-PVFS with a comparable read performance with PVFS. • Load balance in CEFT-PVFS is important and effective.

Thank you Question? http://rcf.unl.edu/~abacus

Hong Jiang ( Yifeng Zhu, Xiao Qin, and David Swanson )