140 likes | 236 Views
A Study of GeneWise with the Drosophila Adh Region. Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA. Motivation. Genome annotation Extraction of biologically relevant knowledge from raw genomic sequence data
E N D
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA
Motivation • Genome annotation • Extraction of biologically relevant knowledge from raw genomic sequence data • Need faster genome annotation methods • DNA sequences are very long (millions of nucleotides) • Current methods are computationally too expensive • Approach/Solution • GeneMatcher2 hardware acceleration of GeneWise CMSC 838T – Presentation
Outline • Motivation • Genome annotation • GeneMatcher2 • Design • ASIC hardware • Comparison • GeneWise algorithm • HalfWise algorithm • Performance (time, precision) • Observations • Performance improvement • Cost effectiveness CMSC 838T – Presentation
Approach • Problem: make GeneWise run faster • “Embarassingly parallel” algorithm • Computationally too expensive when run in parallel on PC’s • Paracell’s solution: hardware acceleration • Don’t change the algorithm • Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible • 6LITE algorithm, now also in Wise2 CMSC 838T – Presentation
GeneMatcher Architecture CMSC 838T – Presentation
ASIC Hardware • ASIC – application specific integration circuit • Designed to speed up dynamic programming algorithms • (could be used for Smith-Waterman) • Each ASIC board has 3072 processors • System has up to 9 boards • Cost per board around $40K CMSC 838T – Presentation
GeneWise Algorithm • Perform a search of genomic DNA sequence data using a protein HMM • Build HMMs from protein families • Scan genome using HMM • Look for start codon • “GT” sequence signals possible 5’ splice site • “AG” sequence signals possible 3’ splice site • Dynamic programming used in the scanning process • Obtain probability of the most likely path in HMM generating the sequence • Obtain alignment by backtracking CMSC 838T – Presentation
GeneWise model on GeneMatcher2 CMSC 838T – Presentation
HalfWise Algorithm • Reduce cost by running BLAST to select HMMs with possible hits • Use these HMMs with GeneWise database search and sequence alignment algorithm • May miss some genes due to BLAST misses CMSC 838T – Presentation
Evaluation • Test data set • A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region • Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence CMSC 838T – Presentation
Evaluation: Speed CMSC 838T – Presentation
Evaluation: Score CMSC 838T – Presentation
Evaluation: Sensitivity and Specificity CMSC 838T – Presentation
Observations • Performance improvement • The speedup is several orders of magnitude. • Makes real target applications possible • Accuracy might be improved over HalfWise algorithm • Cost effectiveness • System used costs around $500K • 500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower • Weaknesses • Cannot modify the algorithm • Not enough data to assess scalability CMSC 838T – Presentation