110 likes | 277 Views
Using cDNA sequence quality value to improve cDNA-genomic sequence alignment. Chaochun Wei Lab Meeting 10/19/2005. Motivation. High quality spliced alignments are critical to many bioinformatics application. (MGC clone quality validate, TWINSCAN_EST) Currently, reliable spliced alignments
E N D
Using cDNA sequence quality value to improve cDNA-genomic sequence alignment Chaochun Wei Lab Meeting 10/19/2005
Motivation • High quality spliced alignments are critical to many bioinformatics application. (MGC clone quality validate, TWINSCAN_EST) • Currently, reliable spliced alignments • Same organism • Low sequencing error rate
pairHMM Using Quality Sequence • Input: • Genomic sequence • cDNA sequence • cDNA quality value sequence • Output • Spliced alignment of cDNA and Genomic Sequence.
An Example of Quality Value Sequence • >gnl|ti|154040434 name:UI-H-EI1-ayz-b-20-0-UI.s1 • 8 9 11 18 19 29 29 34 29 32 32 27 27 26 20 26 27 33 33 39 • 39 33 33 29 32 29 29 37 37 37 40 40 40 40 40 40 37 51 51 35 • 35 35 35 35 33 29 29 30 30 30 35 35 45 40 40 37 37 46 46 46 • 56 40 40 40 40 51 51 51 51 51 51 56 56 56 56 56 56 56 56 56 • 56 56 56 56 56 56 56 56 56 56 40 40 35 35 35 35 35 35 35 35 • 37 37 37 42 40 40 40 46 46 42 42 56 48 48 56 42 46 46 40 35 • 40 45 37 40 51 51 51 51 51 42 51 56 56 42 40 40 40 40 35 42 • 31 31 14 15 23 33 40 46 40 40 40 40 40 40 42 42 56 48 44 44 • 44 42 40 37 37 37 40 42 42 35 35 35 42 42 42 42 42 44 56 42 • 42 40 35 26 25 15 15 25 27 42 44 48 37 37 35 35 35 29 37 33 • 33 33 32 29 29 27 27 29 24 22 25 40 40 29 29 29 25 25 29 29 • 40 40 40 40 34 34 33 34 40 40 40 40 32 32 18 18 18 31 24 20 • 13 18 18 25 31 40 29 29 29 29 13 11 9 11 11 12 12 25 27 27 • 28 24 25 23 29 29 29 29 28 32 23 19 22 23 27 27 32 32 29 29 • 32 34 40 40 40 40 40 40 40 46 44 44 40 27 27 25 19 19 23 28 • 36 40 31 29 22 22 25 19 19 16 16 16 25 22 21 21 29 30 29 29 • 29 32 27 25 22 22 25 27 29 22 25 27 24 20 13 4 0 4 13 13 • 13 11 13 13 13 19 15 15 24 24 16 22 29 29 25 29 27 27 27 27 • 29 25 29 34 29 29 25 26 36 26 25 18 13 13 18 27 13 12 9 9 • 8 16 20 24 29 23 30 21 24 24 29 25 24 16 20 21 11 11 17 18 • 24 25 19 14 21 11 7 6 6 7 6 6 10 12 10 15 15 9 9 10 • 18 16 18 14 12 20 20 21 14 11 10 8 12 13 15 15 10 11 15 8 • 14 13 10 10 10 9 9 8 9 8 8 8 6 6 9 9 9 8 6 6 • 4 0 4 6 8 8 8 8 15 4 0 4 13 13 9 8 8 7 6 6 • 6 7 7 8 8 8 9 8 7 7 10 7 7 9 10 8 6 6 6 7 • 6 6 7 7 4 0 4 6 6 4 0 4 6 4 7 7 8 7 7 7 • 4 0 4 4 4 4 8 4 0 4 7 7 7 7 8 7 7 8 8 8 • 7 7 8 4 0 4 6 6 6 8 6 4 6 4 0 4 7 7 7 7 • 4 0 4 7 7
PairHMM using Quality Value Sequence End Begin
RG EG qual EC RG EG qual EC Graphical Model for States in PairHMM with Quality Value Sequence Model Null-Model RG: Genomic sequence EG: EST/cDNA sequence EC: EST base call qual: Quality value
Graphic Model of the States in PairHMM with Sequence Quality Value Model Null-Model Score
Initial Parameter Estimation • From Phred paper: • From dbSNP human data:Pr(RG|EG) • From human genome: Pr(RG)
Data Sets • NCBI35 Chr20, 21, 22 • Human reads (9/15/2005) aligned to Chr20, 21 and 22 by BLAT. (total 23,753 ESTs )
Results • INTRON# INTbases GGAP# GGbases EGAP EGbases MATCHbase MISMATCHbase EXPLAINED • qpair • 40286 138858553 14502 23940 25945 29386 16158984 24991913376 • est2gen • 35785 229102950 68700 78753 282099 296870 16674021 384138 12591 • sim4 • 40348 476722905 56903 292246 106106 27211366 15347251 254234 12536 • consider mismatches with quality value <=5 as explained • qpair • 40286 138858549 14502 23940 25945 29386 16158984 249919 39683 • est2gen • 35785 229102946 68700 78753 282099 296870 16674021 38413857544 • sim4 • 40348 476722901 56903 292246 106106 27211366 15347251 254234 38794