BIOINFORMATICS

BIOINFORMATICS Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein Deepak Verghese CS 6890

Number of models have incorprated evolutionary information in them • GPHMM • CONSERVED Exon method • 2 step GLASS n ROSETTA • TWINSCAN which extends GENESCAN • etc

Do not exploit all information in evolutionary pattern • Not easily extended to multiple genome sequences.

EVOLUTIONARY HIDDEN MARKOV MODEL (EHMM) A Probabilistic model of both Genome Structure and Evolution • Composed of : • Hidden Markov Model (HMM) • Phylogenetic Tree

ADVANTAGES • Can handle any number of sequences in an alignment. • Can have properties of higher order HMM’s • Can handle variability in the sequences along the alignment • State of art evolutionary models can be incorporated later • Evolutionary events between different genomes are not treated independently

MODEL • SCOPE • Not to compete with the existing finding methods • on performance but to illustrate the power of this approach. • Relies on a pre produced alignment.

MARKOV CHAINS • A set of states • The transitions from one state to all other states, including itself, are governed by a probability distribution • First order Markov chain: the probabilities depend solely on the current state • n-th order Markov chain: n previous states

HIDDEN MARKOV MODEL 5 Components • A set of states • Matrix of transition probabilities ( A ) • Set of alphabets ( C ) • Set of emission distribution (e) • Initial state distribution ( B )

A C A - - - A T G T C A A C T A T C A C A C - - A G C A G A - - - A T C A C C G - - A T C Example of hidden Markov model NO 1:1 correspondence between states and symbols Why the name Hidden ?

Components • State k • Emits symbols (observables) C • PROBABILISTIC MODEL Emission Distribution e Initial state distribution B Transition Probabilities A

Path Π Different paths possible for same sequence

In EHMM Emission distribution e specified by Evolutionary model Ek Phylogenetic tree T

PHYLOGENETIC TREES

In Phylogenetic trees Leaves represent present day species Character states of inner nodes are missing data Interior nodes represent hypothesized ancestors The length of the brances of a tree represent the evolutionary difference. Motivation :The problem of explaining the evolutionary history of today's species

Evolution is often modeled by continuous markov chains Here evolution along the branches of the phylogenetic tree is modelled by Ek Transition probability Pk ( t ) For a branch length t P k ( t ) = exp ( t Q k ) Increasing the number of sequences is increasing the amount of evolutionary information. THE ALIGNMENT COLUMN CORRESPONDS TO THE STATE OF ELOVUTION AT THE LEAVES OF THE PHYLOGENETIC TREE

THE PEOPABILITY OF GENERATING AN ALIGNMENT COLUMN IN STATE K EQUALS PROBABILITY OF OBSERVING A GIVEN CHARACTER PATTERN ON THE LEAVES OF T WHEN GIVEN E k Phylogenetic tree of the entries of the 3 alignment columns

Codon based evolutionary model used to calculate emission probability of columns of A • Nucleotide Based evolutionary model used to calculate emission probability of column B • Emission probability of C is got from the equilibrium distribution of the the relevant evolutionary model

Parameter Estimation Parameters of HMM are estimated by a combination of Baum – Welch Powell Evolutionary model E divided into E equ E evo

Initial State Distribution B can be estimated by Baum-Welch but It is generally set to 0.000 01 for all states except the intergenic . The expectation step of Baum-Welch estimates the number of nucleotides emitted from each state the expected number of state transitions Expected number of times a state is used. Powell another optimization method estimates E evo phylogenetic tree T Baum – Welch method is used to estimate E equ A

Therefore Likelihood of an alignment ( x ) given a parameterization of the EHMM Can be found by the equation Here we are summing over all possible paths This can be done in linear time by Dynamic Programming

EUKARYOTIC GENOME MODEL can be used to generate alignments. Reduced model produces only inner exons. EHMM is fully probabilistic and can be used to simulate data and find genes. eukaryotic EHMM

Results Benefits of modeling evolution with a EHMM using a data set of orthologous mouse/human gene pair Benefit will depend on divergence between sequences compared Key parameter for modelling the difference between exons and introns is the dN/dS ratio.

Moreover we see that Evolutionary model shows a distinct difference between the intergenic /intron state and the codon state

Evaluations were performed on both single and aligned sequences

Graphical Representation

Simple model used now not comparable to state of art methods Any number of aligned sequences can be handled

Extensions of the model • GENESCAN can be extended into HMM • Splice site finders • Models of ribosome binding site and promoter regions • Non – geometric length distributions of exons • Pseudo higher order EHMM can be constructed. • Idea of pair HMM to multiple sequences

Disadvantages in present model • Existing frame work does not model gaps but treats it as missing data. • Optimal data for EHMM is a multiple alignment of full – length genome. • Challenge in constructions of the alignment is to reduce the noise per signal ratio. BUT ………..

BIOINFORMATICS

BIOINFORMATICS

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics