140 likes | 221 Views
InCoB 2009. MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads. Hua Bao Sun Yat-sen University, Guangzhou, China Evolution.sysu.edu.cn. 2009-09-10. Next-generation sequencing. High-throughput (tens of millions reads per lane)
E N D
InCoB 2009 MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou, China Evolution.sysu.edu.cn 2009-09-10
Next-generation sequencing • High-throughput (tens of millions reads per lane) • Read length is short (25-50bp) • Sequencing error rate is relatively higher than Sanger sequencing • Applications: genome sequencing, transcriptome sequencing, pooled population sequencing
The objective 1. Unspliced alignment of reads onto the genome 2. Spliced alignment of transcript reads over exon-intron boundaries 3. SNP detection from population sequences
Seed hash table Read 1 TACACCACGGTCAGACTTGCATCACAACTGTTAAGC AGACTTGCATCACAACTGTTAAGCTACACCACGGTC Read 2 Read n … … Seed hash table TACACCACGGTC Position 1, Read 1, + ;Position 25, Read 2,+;… GACCGTGGTGTA Position 1, Read 1, - ; Position 25, Read 2,-;… AGACTTGCATCA Position 13, Read 1, + ;Position 1, Read 2, +; … TGATGCAAGTCT Position 25, Read 1, -; Position 13, Read 2,-; … Other seed (K-mer) … …
Coding A: 0 T: 1G: 2C: 3 k-mer CCGATT key = 3*45 + 3*44 + 2*43 + 0*42 + 1*41 +1*40 Seed hash table Reads Seed hash table [0] (read id, position, strand) [1] [2] [..] [n] (1,1,+) (2,13,-) … [0] Read sequence [1] CCGATTGGCTAAA … [2] [..] [n] Key=n Key computation of the seed
Unspliced alignment Seed hash table Reads [0] (read id, position,strand) [1] [2] [3] (1,1,+) (2,13,-) … [n] [0] Read sequence [1] [2] [3] [n] O(1) Extension Key=3 Genome TACACCACGGTCAGACTTGCATCA … K-mer:8-12bp Step-size: 1bp
Spliced alignment Seed hit list Hash table Reads [0] Read sequence [1] TACACCACG … [2] [n] [0] (read id, posi,strand) [1] [2] (1,H,+) (2,T,-) … [n] [0] (Genome posi, read posi, strand) [1] (1,H,+) (780,T,+) … [2] (1,T,-) … TACACCACGGTCAGAGTGCCATGGCTAGT TACACCACGGTCAGAgtac … ccagGTGCCATGGCTAGT 1 780 O(1) Key=2 Genome TACACCACGGTCAGACTTGCATCA … K-mer:6-10bp Step-size: 1bp
Accuracy of alignment A total of 1893118 reads (35bp length, 134274 spliced and 1758844 unspliced) from 5796 coding DNA sequences of chromosome I of Arabidopsis thaliana for the query dataset were simulated.
SNP detection from population sequences … TACACACGGTCAGACTAGCATCAGTCCGTAATGCT … CACGGTCAGACGAGCATCAGTCC CACACGGTCAGACGAGCATCAGT GGTCAGACGAGCATCAGTCCGTA CAGACTAGCATCAGTCCGTAATG CACACGGTCAGACTAGCATCAGT GGTCAGACTAGCATCAGACCGTA GGTCAGACTAGCATCAGTCCGTA CGGTCAGACTAGCATCAGTCCG Quality control:minimum quality score (MQS), minimum neighbour quality score (MNQS) Significance control:minimum coverage (MC),minimum minor allele frequency (MMAF)
Clustered short reads N Reads that passed QC? Y N Polymorphism sites are covered by MC number of reads? Y The frequency of minor allele is higher than MMAF? N Y Candidate SNPs SNP detection from population sequences
Accuracy of SNP detection from population sequencing There were 2162 true SNPs in 50 individuals (haploid) in our simulation. Coverage equals sequencing depth per individual. MQV, MNQV, MMAF and MC were set at 25, 20, 0.01 and 50 (1X per individual), respectively.
Accuracy of MAF estimation from population sequencing 0.48 0.44 0.40 0.36 0.32 0.28 Estimated minor allele frequency 0.24 0.20 0.16 0.12 0.08 0.04 0.00 0.00 0.06 0.12 0.18 0.24 0.30 0.36 0.42 0.48 Real minor allele frequency
Summary 1. MapNext supports both spliced and unspliced alignments of the short reads. And for spliced alignments, a training process is not needed. 2. MapNext can detect SNPs and estimate minor allele frequency from population sequences.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Thank you! 2009-09-10