1 / 35

Repeats!

Repeats!. Introduction. A repeat family is a collection of repeats which appear multiple times in a genome. Our objective is to identify all families of interspersed repeats within a single genome. Challenges when identifying repeat families. Challenges:

penda
Download Presentation

Repeats!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Repeats!

  2. Introduction • A repeat family is a collection of repeats which appear multiple times in a genome. • Our objective is to identify all families of interspersed repeats within a single genome

  3. Challenges when identifying repeat families . . . . . . • Challenges: • Regions containing repeat occurrences are not known a priori • Repeat boundaries are not known a priori • Many repeat occurrences appear as partial copies

  4. Why are repeats important • Repeats have been implicated in: • Genome rearrangements (Kazazian, 2004; Achaz et al 2003) • Accelerated loss of gene order (Rocha et al, 2003) • Creation of novel biological functions (Lynch et al, 2002) • Increased rate of evolution under stress (Capy et al, 2000)

  5. Identifying repeats de novo • Assume we get a new genome and we know nothing about it, we can: • Use a database of known repeats (RepeatMasker/RepBase) • novel repeat elements may not be in the database • repetitive gene families are never in the database • Identify repeats de novo using sequence analysis

  6. Existing methods for detection of repeat families • Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities: • REPuter (Kurtz et al., 2000) • RepeatFinder (Volfovsky et al., 2001) • RECON (Bao and Eddy, 2002) • RepeatGluer (Pevzner et al., 2004) • PILER (Edgar and Myers, 2005) • RepeatScout (Price et al, 2005)

  7. Mutational forces at play • Over time, indels & substitutions will affect copies of repeat families: • AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCDTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • Require alignments (& gaps) to attempt to reconstruct true repeat boundaries

  8. de novo repeat detection • One approach: self-search with a pairwise local-alignment tool such as BLAST • Number of pairwise alignments grows O(r2) in the copy number of the repeat • Inherent difficulty defining repeat boundaries among collections of pairwise alignments

  9. An example local multiple alignment: • AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC • AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC • AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- • AACAAGCAGACACTTTTATCCATGGTCGTGGTAC--------- • AACAAGCA----CTTTTATCCATAGTCGTGGTA---------- • ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC Alternative methods? • Local multiple alignment A single local multiple alignment uses O(N) space for a genome of length N

  10. Local multiple alignment • Local multiple alignment has the inherent potential to avoid pitfalls associated with pairwise alignment. • But multiple alignment under the SP objective function remains intractable… • Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE). • So why not directly construct a multiple alignment?

  11. Steps 1-3: Chaining seeds from the Input Sequence • The method incorporated three novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered.

  12. Step 4: Gapped Extension • After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries • This is an essential step to consider, assuming we would like to improve repeat boundary predictions • But how can this be done efficiently?

  13. Our approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  14. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Dynamically calculate extension window = 70*e -0.01*|Mi| |Mi| = 200 , l = 10

  15. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use MUSCLE to perform alignment of extension window

  16. HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence

  17. HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Extension successful, continue extending

  18. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  19. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence

  20. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Finished leftward extension, now to the right…

  21. HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC

  22. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Perform MUSCLE alignment on window

  23. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use HMM to detect & unalign unrelated sequence

  24. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Extension successful, continue extending

  25. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGAGCAGCCACCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use MUSCLE to perform alignment of extension window

  26. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Use HMM to detect & unalign unrelated sequence

  27. HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Extension failed, stop extending

  28. Wait a moment.. • The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry. • As a result, it is likely that this method forcibly aligns unrelated sequence. • HMMs to detect alignments of unrelated sequence.

  29. Step 5: detecting unrelated sequence • The HMM consists of two hidden states, Homologous and Unrelated. • The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry • i.e. AG=GA=TC=CT. • The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.

  30. 0.5 UUUU H U • Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry: UAA = UAT = UTA = UTT = (fAT)/2 * (fAT)/2 UCC = UCG = UGC = UGG = (fGC)/2 * (fGC)/2 UAC = UAG = UTC = UAG = (fAT)/2 * (fGC)/2 UCA = UCT = UGA = UTT = (fGT)/2 * (fAT)/2

  31. 0.5 UUUUUU H UU • To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.

  32. 0.5 UUUUUUUUUUUU H UU • Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences.

  33. 0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU • Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms.

  34. 0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU

More Related