350 likes | 503 Views
Repeats!. Introduction. A repeat family is a collection of repeats which appear multiple times in a genome. Our objective is to identify all families of interspersed repeats within a single genome. Challenges when identifying repeat families. Challenges:
E N D
Introduction • A repeat family is a collection of repeats which appear multiple times in a genome. • Our objective is to identify all families of interspersed repeats within a single genome
Challenges when identifying repeat families . . . . . . • Challenges: • Regions containing repeat occurrences are not known a priori • Repeat boundaries are not known a priori • Many repeat occurrences appear as partial copies
Why are repeats important • Repeats have been implicated in: • Genome rearrangements (Kazazian, 2004; Achaz et al 2003) • Accelerated loss of gene order (Rocha et al, 2003) • Creation of novel biological functions (Lynch et al, 2002) • Increased rate of evolution under stress (Capy et al, 2000)
Identifying repeats de novo • Assume we get a new genome and we know nothing about it, we can: • Use a database of known repeats (RepeatMasker/RepBase) • novel repeat elements may not be in the database • repetitive gene families are never in the database • Identify repeats de novo using sequence analysis
Existing methods for detection of repeat families • Nearly all existing algorithms for de novo identification of repeat families rely on a set of pairwise similarities: • REPuter (Kurtz et al., 2000) • RepeatFinder (Volfovsky et al., 2001) • RECON (Bao and Eddy, 2002) • RepeatGluer (Pevzner et al., 2004) • PILER (Edgar and Myers, 2005) • RepeatScout (Price et al, 2005)
Mutational forces at play • Over time, indels & substitutions will affect copies of repeat families: • AGGCTACCCCTTTAGGCTAGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTDGGGGGGAGGCTATCTCTCCTAGGCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCDTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • AGGCTGCCCCTTTAGGCTGGGGGGAGGCTCTCTCTCCTAGCCTATTTTTTAGCTATT • Require alignments (& gaps) to attempt to reconstruct true repeat boundaries
de novo repeat detection • One approach: self-search with a pairwise local-alignment tool such as BLAST • Number of pairwise alignments grows O(r2) in the copy number of the repeat • Inherent difficulty defining repeat boundaries among collections of pairwise alignments
An example local multiple alignment: • AACAAGCA-A-ACTTTTATCCATGGTCGTGGTACAGAGGGGTC • AACAAGCA-A-ACTTTTGTCCATGGTCGTGGTACAGAGTGGTC • AACATGCAGA-ACTTTTATCCATGGTCGTCGTACAGAGGGGT- • AACAAGCAGACACTTTTATCCATGGTCGTGGTAC--------- • AACAAGCA----CTTTTATCCATAGTCGTGGTA---------- • ------------CTTTTATCCATGGTCGTGGTACAGAGGGGTC Alternative methods? • Local multiple alignment A single local multiple alignment uses O(N) space for a genome of length N
Local multiple alignment • Local multiple alignment has the inherent potential to avoid pitfalls associated with pairwise alignment. • But multiple alignment under the SP objective function remains intractable… • Progressive alignment heuristics offer excellent speed and accuracy (i.e. MUSCLE). • So why not directly construct a multiple alignment?
Steps 1-3: Chaining seeds from the Input Sequence • The method incorporated three novel ideas: (1) palindromic spaced seed patterns to match both DNA strands simultaneously (2) seed extension (chaining) in order of decreasing multiplicity, and (3) procrastination when low multiplicity matches are encountered.
Step 4: Gapped Extension • After chaining a seed match, we must perform gapped extension to approximate the true repeat boundaries • This is an essential step to consider, assuming we would like to improve repeat boundary predictions • But how can this be done efficiently?
Our approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Dynamically calculate extension window = 70*e -0.01*|Mi| |Mi| = 200 , l = 10
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA GTTGCGGCCCCTGAGGCACTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA CTTAAGGCCCCTGAGGATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGGTAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG CCCGAGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA TAAGCGGCCCCTGAGGCACTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTGGGGCCCCTGAGGATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGCCAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG CCCACGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . GATTCGGCCCCGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use MUSCLE to perform alignment of extension window
HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence
HMM approach to gapped extension ACAAGGGCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA TACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TTCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCC-TCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCC-TGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA AACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC ATTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCC-TGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCC-TGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-CGAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Extension successful, continue extending
HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Use HMM to detect & unalign unrelated sequence
HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC Finished leftward extension, now to the right…
HMM approach to gapped extension ACAAGGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGACTAGCAGCCA ACGAGCCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCCTTTCCTTTAATTTGACATGA TCATCCCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCCTTTCCAAGAGCCCCCGTAGC AGAACGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGAGACAGGATGGACG AGGCCGGCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCCTTTCCTTAAAAAAATTAAAA ACCCGCCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCCTTTCCAATTTGCTCTATGTC TTTTGCCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCCTTTCCGGCCCTTATAGGTAC GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCCTTTCCAAAGAGCGCCCGCGG ATTCCGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTCCGACCGAATTAATTAA . . . ATTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCCTTTCGTTTCCCCCCGGCCCC
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Perform MUSCLE alignment on window
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use HMM to detect & unalign unrelated sequence
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGA---GCAGCCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Extension successful, continue extending
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCCGAGCAGCCACCA TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGACA TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAGAGCCCCCGT AGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCGAGACTAGGATGG CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAATTA AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCCAATTTGCTCTAT TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCCGGCCCTTATAGG GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCCAAAGAGCGCCCG CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCCGACCGAATTAAT . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCGTTTCCCCCCGGC Use MUSCLE to perform alignment of extension window
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Use HMM to detect & unalign unrelated sequence
HMM approach to gapped extension TAGTTGGCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCCTTTTTTCC---GAGCAGCCAC- TACGAGCCCCTGAGGCA-CTCTTTAACGT-TA-CTGGTCC---TTTCCTTTAATTTGA---- TTCATGCCCCTGAGG-ATCTCTTTAACGTTTCTCTGGTCC---TTTCC----AAGAGCCCCCAGAAAGGCCCTCAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCG---AGACTAGGAT- CCGATGCCCCTGAGGTATCTCTTTAACGTTTC-CTGGTCC---TTTCCTTAAAAAAAT---- AACCCGCCCCTGAGGCA-CTCTTTAACGTTTA-CTGGTCC---TTTCC---AATTTGCTCT- TTTTTGCCCCTGAGG-ATCTCTTTAACGTTTA-CTGGTCC---TTTCC----GGCCCTTATA GGAAAGCCCCTGAGGCATCTCTTTAACGTTTCTCTGGTCC---TTTCC---AAAGAGCGCC- CCTATGCCCCTGAGGCATCTCTTTAACGTTTC-CTGGTCC---TTTCC----GACCGAATTA . . . -TTCGGCCCC-GAGGCATCTCTTTAACGTTTA-CTGGTCC---TTTCG----TTTCCCCCCG Extension failed, stop extending
Wait a moment.. • The MUSCLE alignment software reports the highest scoring global multiple alignment of the input sequences, regardless of common ancestry. • As a result, it is likely that this method forcibly aligns unrelated sequence. • HMMs to detect alignments of unrelated sequence.
Step 5: detecting unrelated sequence • The HMM consists of two hidden states, Homologous and Unrelated. • The observable states are the pairwise alignment columns, which are all possible pairs in {A,G,C,T,-} with strand and species symmetry • i.e. AG=GA=TC=CT. • The emission probabilities for each possible pair of aligned nucleotides were extracted from the HOXD substitution matrix presented by Chiaromonte et al.
0.5 UUUU H U • Compute emission frequencies for the Unrelated state of our HMM using the background frequencies of G/C and A/T, assuming strand and species symmetry: UAA = UAT = UTA = UTT = (fAT)/2 * (fAT)/2 UCC = UCG = UGC = UGG = (fGC)/2 * (fGC)/2 UAC = UAG = UTC = UAG = (fAT)/2 * (fGC)/2 UCA = UCT = UGA = UTT = (fGT)/2 * (fAT)/2
0.5 UUUUUU H UU • To empirically estimate gap-open and extend values for the unrelated state, align a 10-kb, 48% G+C content region taken from E. coli CFT073 (Accession AF447814.1, coordinates 37,300-38,300) with an unrelated sequence.
0.5 UUUUUUUUUUUU H UU • Alignment with MUSCLE on unrelated sequence and counted the number of gap-open and gap-extend columns in the alignment of unrelated sequences.
0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU • Gap-open and extend frequencies for the homologous state were estimated by constructing an alignment of 10kb of orthologous sequence shared among a pair of divergent organisms.
0.5 UUUUUUUUUUUUUUUUUUHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH H UU