GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA

intron exon exon intron intergene exon intergene Find Gene Structures in DNA First Exon State Intron State Intergene State GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA

Hidden Markov Model for Gene Finding • Intron, Exon, Intergenic states • Exon frame is encoded in the architecture by defining more states • Exon states have explicit duration density • Intron states have geometric duration • Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)

Comparison-based Methods

Exon3 Exon1 Exon2 Intron1 Intron2 5’ 3’ Cross-species gene finding [human] [mouse] GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

Comparison of 1196 orthologous genes(Makalowski et al., 1996) • Sequence identity between genes in human/mouse • exons: 84.6% • protein: 85.4% • introns: 35% • 5’ UTRs: 67% • 3’ UTRs: 69% • 27 proteins were 100% identical.

Human Mouse Human-mouse homology

Not always: HoxA human-mouse

Twinscan • Twinscan is an augmented version of the Gencscan HMM. I E transitions duration emissions ACUAUACAGACAUAUAUCAU

Twinscan Algorithm • Align the two sequences (eg. from human and mouse) • Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

Twinscan Algorithm • Run Viterbi using emissions ek(b) where b  { A-, A:, A|, …, T| } Note: Emission distributions ek(b) estimated from real genes from human/mouse eI(x|) < eE(x|): matches favored in exons eI(x-) > eE(x-): gaps (and mismatches) favored in introns

Example Human: ACGGCGACGUGCACGU Mouse: ACUGUGACGUGCACUU Alignment: ||:|:|||||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U| Recall, eE(A|) > eI(A|) eE(A-) < eI(A-) Likely exon

HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

1 - 2 M P(xi, yj) 1-  - 2 1-  - 2   I P(xi) J P(yj)     A Pair HMM for alignments BEGIN I M J END

Generalized Pair HMMs

Exon GPHMM 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e. e d

Exon3 Exon1 Exon2 Intron1 Intron2 5’ 3’ CNS CNS CNS Cross-species gene finding [human] [mouse]

The SLAM hidden Markov model

length seq1 no. states length seq2 max duration Computational complexity

Approximate alignment Reduces TU -factor to hT

Measuring Performance

TBLASTX SLAM SLAM CNS SGP-2 VISTA Twinscan RefSeq Genscan Example: HoxA2 and HoxA3

Suffix Trees (a short break from biology)

Suffix Trees • Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d a b d a c 1 a c b d a b 4 c d c a c c 3 2 6 5

Definition of a Suffix Tree Definition: For string x = x1…xm, a suffix tree is: • A rooted tree with m leaves Leaf i: xi…xm • Each edge is a substring • No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

Naïve Algorithm to Construct a Suffix Tree • Initialize tree T: a single root node r • Insert special symbol $ at end of x • For j = 1 to m • Find longest match of xi…xm to T, starting from r • Split edge where match stops: new node w • Create edge (w, j), and label with unmatched portion of xi…xm

1. Insert d a b d a $ 2. Insert a b d a $ d a b d a $ 3. Insert b d a $ 4. Insert d a $ a $ b 5. Insert a $ d a b 6. Insert $ 4 $ d $ a $ $ 3 2 6 5 Example of Suffix Tree Construction x = d a b d a $ 1

Memory to Store Suffix Tree • Can store in O( N ) memory! • Every edge is labeled with (i, j): (i,j) denotes xi…xj • Tree has O( N ) nodes Proof: • # leafs  # nodes – 1 • # leafs = |x|

Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant ~15 bytes/char Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

Application: find all matches between x, y • Build suffix tree for x, mark nodes with x • Insert y in suffix tree, mark all nodes y “passes from” with y • The path label of every node marked both 0 and 1, is a common substring

y y 2. Insert a b a d a $ 3. Insert b a d a $ a x y 4. Insert a d a $ y d 4 x y a 5. Insert d a $ 6 a $ 6. Insert a $ d d a 6. Insert $ 2 $ a 5 3 $ 1 Example of Suffix Tree construction x = d a b d a $ y = a b a d a $ d a b d a $ 1 1. Construct tree for x x x a $ b d a b 4 $ x d $ a 6 $ $ 3 2 5

Application: common substrings of k strings To find the longest common substring of s1, s2, …sn • Build suffix tree for s1,…, sn • All nodes labeled {si1, …, sik} represent a match between si1, …, sik

Suffix Arrays • Fast O(log n) search for every specific string • Used for data compression such as bzip2 • Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal • Too much memory— ~15n bytes • Difficult to implement • Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory • Hot topic how to build fast in practice ABRACADABRA$ 11 $ 10 A$ 7 ABRA$ 0 ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$

GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA

GTCAG ATGAGCAAAGTAGACACTCCAGTAACGCG GTGAGTACATTAA

Presentation Transcript