470 likes | 721 Views
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment. Ruibin Xi Peking University School of Mathematical Sciences. High-throughput sequencing. HTS platforms Roche 454 platforms Illumina / Solexa platforms (most widely used) Applied Biosystem (ABI) SOLiD
E N D
Biostatistics-Lecture 15High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences
High-throughput sequencing • HTS platforms • Roche 454 platforms • Illumina/Solexa platforms (most widely used) • Applied Biosystem (ABI) SOLiD • HelicosHeliSopeTM sequencer(single molecular sequencing) • Life Technologies platforms • The throughput is increasing and the price is dropping • Short read but high throughput
High-throughput sequencing • HTS platforms • Roche 454 platforms • Illumina/Solexa platforms (most widely used) • Applied Biosystem (ABI) SOLiD • HelicosHeliSopeTM sequencer(single molecular sequencing) • Life Technologies platforms • The throughput is increasing and the price is dropping • Short read but high throughput
What sequencing data can do • Detection of genomic variations (Whole genome sequencing or targeted sequencing) • Single nucleotide polymorphisms (SNP) • Copy number variations (CNV) • Structural variations (SV) • Analyze protein interactions with DNA (ChIP-seq) • Whole Transcriptome study (RNA-seq) • DNA methylation study • and many more …..
Sequencing (Illunima) Mardis NatureReivew Genetics (2010)
What the data look like? Fastq Format detail: 1st line: the name of a short read 2nd line: the read itself (a short sequence of A,C,G,T) 3rd line: the name of the short read or plus (+) sign 4th line: the quality score
General strategy for analyzing HTS data • Alignment-based • Assembly-based
How can we evaluate an alignment • Scores: • Mutation (mismatch): 0 • Match: 1 • Gap: -1
Find a best alignment • Exhaustively search all possible alignments • Computationally too expensive!!! • Observation: for a pair (i,j)
BLAST and BLAT • BLAST: Basic Local Alignment Search Tool • BLAT: BLAST Like Alignment Tool
BLAST • BLAST: • A search algorithm for finding local alignments of two sequences S and T • An associated theory for evaluating the statistical significant • Terminology and notation • S(a,b): scoring system • High-scoring Segment Pair (HSP): • Cannot be extended or shortened without dropping the score
BLAST • The number of HSPs with a score ≥ S approximately follows a Poisson Distribution (under the null hypothesis) with parameter • Assumptions Some probability to take positive score • Based on extreme value theory • E-value • Bit score
BLAST • Algorithm: seed and extend • Build an index for k-mers of the query sequence • Find the hits of the k-mers in the database sequence in the query sequence • Extend the seeds with a score ≥ a threshold and find the HSPs with a score ≥ S • Evaluate the statistical significance
BLAT • Strategy: seed and extend • In the seed stage, detects regions of two sequences that are likely to be homologous • In the extend stage, those regions are examined in detail and alignments are produced. • Index the non-overlapping K-mers of the database sequences instead of the query sequence
BLAT • Strategy: seed and extend • In the seed stage, detects regions of two sequences that are likely to be homologous • In the extend stage, those regions are examined in detail and alignments are produced. • Index the non-overlapping K-mers of the database sequences instead of the query sequence
BLAT • Seeding strategy • Single perfect K-mer matches • Single near perfect K-mer matches • Multiple perfect K-mer matches
BLAT • Some definitions
BLAT • Single perfect match • the probability that a specific K-merin a homologous region of the database matches perfectly the corresponding K-mer in the query • Sensitivity: the probability of a hit (at least one non-overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) • Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely
BLAT • Single perfect match • the probability that a specific K-merin a homologous region of the database matches perfectly the corresponding K-mer in the query • Sensitivity: the probability of a hit (at least one non-overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) • Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely
BLAT • Single imperfect match • The probability • The sensitivity • The specificity
BLAT • Single imperfect match • The probability • The sensitivity • The specificity
BLAT • Multiple perfect Matches • Probability • Sensitivity • Specificity
BLAT • Multiple perfect Matches • Probability • Sensitivity • Specificity
BLAT • Clumping the hits • The hit list L is sorted by database coordinate. • The list L is split into buckets of size 64 kb each, based on the database coordinate. • Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position minus query position. • Hits that are within the gap limit are grouped together into proto-clumps. • Hits within proto-clumps are then sorted by their database coordinate and put into real clumps • Clumps within 300 bp or 100 amino acids of each other in the database are merged
BLAT • Nucleotide alignment • A hit list is generated between the query sequence q and the homologous region h in the database, looking for smaller, perfect K-mers. • If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until the match is unique or exceeds a certain size. • The hits are extended as far as possible, without mismatches • Overlapping hits are merged. • Then extensions using indels followed by matches are considered.