450 likes | 463 Views
This article outlines the process of spectrum-based de novo repeat detection in genomic sequences, exploring the definition, function, and importance of repeats in studying genomic data. The SAGRI algorithm is introduced as a method for repeat finding, and the spectrum-based repeat finder approach is discussed.
E N D
Spectrum-based de novo repeat detection in genomic sequences Do Huy Hoang
Outline • Introduction • What is a repeat? • Why studying repeats? • Related work • SAGRI • Algorithm • Analysis • Evaluation
What is a repeat? (Definition) • [General]: Nucleotide sequences occurring multiply within a genome • [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).
What is a repeat? (Function) • Motifs • Very short repeats (10-20bp) • Transcription factor binding sites • Long and Short interspersed elements (SINE, LINE) • Jumping genes • Genes and Pseudogenes • Tandem repeats • Simple short sequence repeats An, CGGn
Why studying repeats? (1) • Eukaryotic genomes contain a lot of repeats • E.g. Human genome contains 50% repeats. • Repeats are believed to play an important role in evolution and disease. • E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999) • Repeats are important to chromatin structure. • Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003). • It is known that heterochromatin have a lot of SINE and LINE repeats.
Why studying repeats? (2) • Repeats complicated sequence assembly and genome comparison • Many people remove repeats before they analyze the genome. • Repeats set hurdles on microarray probe signal analysis • The probe signal may be inaccurate if the probe sequence overlap with repeat regions. • Repeats may contribute to human diversity more than genes. • Repeats can be used as DNA fingerprint
Steps in Repeat finding • Repeat library (RepeatMasker) • De-novo repeat discovery (two steps): • Identification of repeats • Classification of repeats
Algorithm outline • Input: a text G • FindHitphase: finds all candidate of second occurrence of repeat regions • ACGACGCGATTAACCCTCGACGTGATCCTC • Validation phase: uses hits from phase 1 to find all pairs of repeats • ACGACGCGATTAACCCTCGACGTGATCCTC
Spectrum-based repeat finder • What is a spectrum? • Given a string G, its spectrum is the set of all k-mers. E.g. k=3, G= ACGACGCTCACCCT The spectrum is • ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA • CTC is a k-mer occurring at position 7. • ACG is a k-mer occurring at positions 1, 4.
Observation 1: How to find candidate regions containing repeats? • Two regions of repeats should share some k-mers. • E.g. the following repeats share CGA. ACGACGCGATTAACCCTCGACGTGATCCTC
Feasible extension (bud) i S = ACGACGTGATTAACCCTCGACGTGATCCTC • Given the spectrum S for G[1..i-1]: i CGA A X C Feasible extensions! G X T Note: T is called a fooling probe!
Observation 2 • A path of feasible extensions may be a repeat. • Example: S = ACGACGCTATCGATGCCCTC Spectrum S for G[1..10] is ACG, CGA, CGC, CTA, GAC, GCT, TAT Starting from position 11, there exists a path of feasible extensions: CGA-C-G-C This path corresponds to a length-6 substring in position 2. Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC). 11
Phase 1: FindHit() Algorithm: Input: a text G • Initialize the empty spectrum S • For i = 1 to n /* we maintain the variant that S is a spectrum for G[1..i-1] */ • Let x be the k-mer at position i • If x exists in S, run DetectRepSeq(S,i); • Insert x into S • Note: DetectRepSeq(S,i) looks for repeat occurring at position i.
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1 DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
AAC AAG ACC ACG AGT ATT CCC CCT CGA CTC GAA GAT GTG TAA TCG TGA TTA ACGAAGTGATTAACCCTCGACGCGATCC 1 2 … … 18 19 20 21 22 23 24 25 26 27 28 18 19 20 21 22 23 24 25 26 27 28 CGA C G C G A T C T Ref Curr CGA-T1-T2-A3* A1-G1-T2-G2-A2-T2-T3* C2-C2-C3* G3* DetectRepSeg(S(18), 18)
Other details • Extend backward • Stop backtracking after h steps
Validation phase • Decompose hits into set of k-mer and index all the locations of these k-mers. • Scan for each pair of locations of a k-mer w in the hits, do BLAST extension • Use some auxiliary data structure to avoid double checking • Report the pairs whose length exceed our threshold
Analysis • How to find most repeats? • Avoid false negative • How to get better speed? • Avoid false positive
How do we choose k? (1) • If k is too big, • k-mer is too specific and we may miss some repeat • If k is too small, • k-mer cannot help us to differentiate repeat from non-repeat • For repeat of length 50 and similarity>0.9, • we found that k log4n+2 is good enough.
m 0 How do we choose k? (2) • A random k-mer match with one of n chosen k-mer • Pr(a k-mer re-occurs by random in a sequence of length n) • (analog to throwing n balls into 4k bins) • 1-(1 – 4-k)m 1 – exp(-m/4k). • We requires 1-exp(-n/4k)1, • hence, k log4n + log41. • If we set 1=1/16, k log4n + 2
X X x1 x2 Xm+1 L The occurrence of false negative (missed repeat) (1) • A pair of repeats of length L, with m mismatches • Probability of a preserved k-mer in repeat is • M is the number of nonnegative integer solutions • to Subject to
The occurrence of false negative (missed repeat) (2) • It is easy to see that M is the coefficient of xL−m in • Hence
Criterion for path termination (1) • Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%. • Then, the pruning strategy is length dependent. • If the length of strings in is r, we allow (r) mismatches.
Criterion for path termination (2) • Let q be the mismatch probability and r be the length of the string. • Prob that a string has s mismatches = • For a threshold (says, 0.01), we set • (r) = max {2 s r-2 | Pq(s) > } + 2
Control of false positives (1) • Two typical cases • The probability of (case 1)/ (case 2) is 2*4- • P(case1 or case2) is small • For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8
Evaluation Compare with other programs
Programs • EulerAlign by Zhang and Waterman • PALS by Edgar and Myers • REPuter by Kurtz et al. • SARGRI
Measurement • Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs. • Shared Repeat Region (SRR): the ratio of the found region to the reference region.
Simulated data • Conclusion from simulated data • The result is consistent with the analysis
Genome data • M.gen (0.6 Mbp) • Organism with the smallest genome • Lives in the primate genital and respiratory tracts • C.tra (1 Mbp) • Live inside the cells of humans • A.ful (2.1 Mbp) • Found in high-temperature oil fields • E.coli (4 Mbp) • An import bacteria live inside lower intestines of mammals • Human chr22 p20M to p21M (1Mbp)
Use CR and SRR ratio to measure • Cross validation • G/H=1, H/G<1 G “outperforms” H • G/H<1, H/G=1 H “outperforms” G • G/H<1, H/G<1 G, H are complementary • G/H=1, H/G=1 G, H are similar
H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008