Apply Bioinformatics Applications on Parallel and Grid Computing Environment

Apply Bioinformatics Applications on Parallel and Grid Computing Environment 應用生物資訊軟體於平行及網格計算環境東海大學資訊工程與科學系高效能計算實驗室楊朝棟 ,郭育倫國立高雄應用科技大學電機研究所 presenter : Yu-Ming Wang

1 Introduction 3 Parallel Bioinformatics 4 System Environment 6 Conclusions Click to add Title Outline 2 Bioinformatics , BioGrid 5 Experimental Resulet

Introduction • Bioinformatics tools can speed up the analysis of large-scale sequence data, especially about sequence alignment. • Hardwares： • PC clusters; one master node, seven slave nodes (16 processors totally) • Sun Fire 6800 Sever • Grid System • Bioinformatics tools： • mpiBLAST (MPI) • FASTA (MPI) • HMMs (PVM-Parallel Virtual machine)

Bioinformatics • Creation of database allowing storage and management of large biological deta set. • Development of algorithems and statistics to determine relationships between members. • Use above tools for analysis and interpretation of biological data.

Grid Computing • To make more effective use of computer resource. • As a way to solve problems that required enormous of computer power. • The resources of many computers can be toward a common objects.

PC Cluster PC Cluster Local BioGrid Global BioGrid BioGrid • Construct the BioGrid system is necessary for research to reducedthe sequence alignment time.

Parallel Bioinformatics I (BLAST) • Basic Local Alignment Search Tool - 核酸與蛋白質序列比對工具 • [blastall] : • [blastpgp] : 搜尋 PSI-BLAST(Position-Specific Iterated BLAST ; 一種輸入蛋白質序列查詢蛋白質資料庫，搜尋是否屬於某個蛋白質家族的BLAST程式。 • [bl2seq] : 2條核酸或蛋白質序列比對 • [formatdb] : 將序列資料轉換成FASTA格式,再輸入BLAST的資料庫 • mpiBLAST is based on MPI. 核酸序列比對蛋白質序列比對核酸序列與蛋白質資料庫比對蛋白質序列與轉譯核酸資料庫比對核酸序列與轉譯核酸資料庫比對

Parallel Bioinformatics II (FASTA) • FASTA is a searching sequence programs that are similar to the BLAST modes, exception of PSI-BLAST, therefore provide very fast searchs of sequence database.(DNA and protein) • [fasta] 使用FASTA演算法來對DNA序列與DNA資料庫比對或protein序列跟 protein資料庫比對 • [ssearch] 使用Smith-Waterman演算法再次進行上述的比對程序 • [fastx/fasty] 將DNA序列與protein資料庫作比對，並在DNA序列上執行轉譯 • [tfastx/tfasty] 將protein序列與DNA資料庫作比對，並在protein序列上執行轉譯 • [align] 在兩組DNA或protein序列中，計算排列組合 • [lalign] 在兩組DNA或protein序列中，計算局部的排列組合

Parallel Bioinformatics III (HMMs) • HMMs (Hidden Markov Models) can be used to do database searching using statistical descriptions of a sequence families. • [hmmpfam] 要求在HMM資料庫上進行序列搜索，並試著在未知的序列上加上註解 • [hmmindex] 在HMM資料庫上建立二進制SSI索引(binary SSI index) • [hmmsearch] 搜索HMM的序列資料庫，找出更多類似的序列組合 • [hmmalign] 排列多種序列(align multiple sequence) • [hmmbulid] 從多種序列排列建立一個HMM • [hmmcalibrate] 讀取HMM，並校正它的搜尋統計(search statistics)法 • [hmmemit] 產生一個"一致性"(consensus)的序列 • [hmmfetch] 從HMM資料庫重新取回HMM

Our System Environment(I) • Linux PC Cluster • One server node • AMD ATHLON MP 2000+ processors • 1 GB shared memory • Seven slave nodes • AMD ATHLON MP 1800+ processors • 512 MB shared memory • 100Mbps Ethernet switches • Sun Fire 6800 Server • 8 UltraSPARC III Cu 1.2-GHz processors • 8 GB main memory • Setup by Solaries 8 operation system

Our System Environment(II) • The Grid System • Each clusters has one master node， two slave nodes. • 3COM 3C9051 10/100 Fast Ethernet Card • AC-EX3016B Switch HUB • Globus Toolkit v2.4

Experimental Results (I) • The Experimental Results on PC Cluster • The Performance of mpiBLAST near two times

HMMs • The Performance of HMMs • saved about a half time • speedup : near two times

Experimental Results (II) • The Experimental Results on Sun Fire 6800 • The Performance of FASTA • speedup : near two times

The Performance of HMMs • saved about a half time • speedup : near two times

Experimental Results (III) • The Experimental Results on Grid System • The performance has obvious improvement and it can save about one-third time???? 250sec 160sec

Conclusions The parallel computer and grid system can save more time for sequence analysis. Therefore, the parallel bioinformatics tools can help us reduce the waiting time of alignment and improve performance about sequence alignment.

Life is too short & DNA is too long!!

Apply Bioinformatics Applications on Parallel and Grid Computing Environment