Bioinformatics

Bioinformatics Motif Detection Revised 27/10/06

Overview • Introduction Multiple Alignments • Multiple alignment based on HMM • Motif Finding • Motif representation • Algorithm • Search Space • Word counting methods • Probabilistic methods • Profile Searches • Introduction • Exercises http://www.esat.kuleuven.ac.be/~kmarchal/

Introduction • Global multiple alignment (ClustalW) • Proteins, nucleotides • Long stretches of conservation essential • Identification of protein family profiles • Score gaps • Local multiple alignments (motif detection) • Proteins, nucleotides • Short stretches of conservation (12 NT, 6 AA) • Identification of regulatory motifs (DNA, protein) • No explicit gap scoring • Explicit use of a profile

Overview • Introduction Multiple Alignments • Multiple alignment based on HMM • Motif Finding • Motif representation • Algorithm • Search Space • Word counting methods • Probabilistic methods • Profile Searches • Introduction • Exercises

HMM

Overview • Introduction Multiple Alignments • Multiple alignment based on HMM • Motif Finding • Motif representation • Algorithm • Search Space • Word counting methods • Probabilistic methods • Profile Searches • Introduction • Exercises http://www.esat.kuleuven.ac.be/~kmarchal/

signal cell chromosome sigma motif Gene 1 Gene 2 Gene 3 Gene 4 gene transcription ? mRNA translation protein Transcriptional regulation

Transcriptional regulation

Motif Representation Consensus sequence: • reductionistic representation of a motif • Most frequent instance is used as a representative • Loss of information Regular expression: • More complex representation allowing motif degeneracy Position specific scoring matrix (PSSM): • Probabilistic representation

Motif Representation Consensus CTTAATATTAACTTAAT Regular expression CTTAAKRTTMAYTTAAT PSSM (motif logo)

Motif Representation

Overview Algorithms Search for motifs that are present more frequently in a set of sequences than in a set of unrelated sequences • Methods based on word counting (regular expression) • NP problems, heuristic methods clever algorithms • motif w=8; combinations=8! • Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 • Probabilistic methods (weight matrix) • Multiple alignment by locally aligning small conserved regions in a set of unaligned sequences. • Motif model represented by a probability matrix • EM, Gibbs sampler (optimization algorithms) • AlignACE http://atlas.med.harvard.edu/ • BioProspector: http://bioprospector.stanford.edu/ • Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html

Search space • When are motifs overrepresented statistically? • Set of coexpressed (coregulated sequences) • Literature searches • Microarrays, expression profiling • Set of orthologous sequences (phylogenetic footprinting) • Comparative genomics • Orthologous sequences similar ancestral origin => similar mechanism of transcriptional regulation

Motif finding coexpression Search space Preprocessing of the data cDNA arrays Clustering Upstream regions Gibbs sampling EMBL BLAST

Phylogenetic footprinting Search space • PhoPQ ubiquitous system • Salmonella • Escherichia • Yersinia • Vibrio • Pseudomonas • Providencia • Pectobacterium • PhoPQ is autoregulated

Search space

Overview Algorithms • Methods based on word counting • NP problems, heuristic methods clever algorithms • motif w=8; combinations=8! • Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 • Probabilistic methods • Optimisation problems, self learning, AI • Motif model represented by a probability matrix • Bayesian, Gibbs sampler • AlignACE http://atlas.med.harvard.edu/ • BioProspector: http://bioprospector.stanford.edu/ • Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html

Word Counting Monad frequencies: single word counts: (RSA tools) (J. Vanhelden et al., 1998 J. Mol. Biol.) • Enumerate all oligonucleotides • count the number of occurrences of all oligonucleotides of selected size in a set of coregulated genes • compare the number of occurrences with its expected value in the background http://bio.cigb.edu.cu/jvanheld/rsa-tools/RSA_home.shtml

Word Counting Relevance of the motifs detected p-Value and Sig score (string based methods) • Expected number of occurrences in background • Statistical significance

Probabilistic Algorithms • Methods based on word counting • NP problems, heuristic methods clever algorithms • motif w=8; combinations=8! • Jensen & Knudsen, 2000; van Helden, 2000; Vanet, 2000 • Probabilistic methods • Optimisation problems, self learning, AI • Motif model represented by a probability matrix • Bayesian, Gibbs sampler • AlignACE http://atlas.med.harvard.edu/ • BioProspector: http://bioprospector.stanford.edu/ • Motif Sampler http://www.esat.kuleuven.ac.be/~dna/BioI/Software.html

Probabilistic Algorithms Find common motifs, that represent regulatory elements, in the region upstream of translation start in a set of co-expressed DNA sequences • Motifs are hidden in background sequence

Probabilistic Algorithms • Motif Representation: Probability matrix (PSSM) • Background model • Single nucleotide frequencies • Described by an mth order Markov process, that can be represented by a transition matrix

Probabilistic Algorithms Step 1:Initialization of alignment vectorA (predictive update) j 1 i n Step 2: Calculate motif model for all sequences except one G A A T T C A T G T C A C T T C A T T G

GAATTATCGTGAATGCGTGGT Probabilistic Algorithms • Step 3 (expectation): • Select remaining sequence • For each window (site) calculate the probability that the sequence in the window is generated by the motif model versus the probability that it is generated by the background model 1 i n P(S|M) = 0.0098 x 0.0097 x 0.495 x 0.0098 x 0.245 P(S|B) = • Assign weight based on this score to this site

Probabilistic Algorithms Step 4 (Maximization): • Re-estimate new positions based on the weights calculated in step 3 • Go to step 1 j j 1 1 i i n n • Re-iterate until stable motifs are found

Probabilistic Algorithms • local optima • EM update alignment vector: • Select positions with highest score • Deterministic output but local minimum • global optimum • Gibbs sampling • Select positions according to probability distribution • Stochastic output: • i.e. result differs each time the algorithm runs • allows to detect stable motifs • statistical analysis describes quality of the motif detected

Probabilistic Algorithms • Influence of the background model:e.g. p(ATCGT|Bm)=p(AT)p(C|AT)p(G|TC)p(T|CG) • Compensates for motifs that occur frequently because of the general background composition • Makes the outcome of the algorithm more robust

Probabilistic Algorithms Two organisms with similar background model Two organisms with different background model

Probabilistic Algorithms Motif scores for probabilistic motif finding algorithms • Information content (Consensus score) • Entropy • Relative entropy (Information content) • Log likelihood

Result: bacterial O2 responsive element FNR Probabilistic Algorithms Does only take into account the degree of conservation Takes into account the background model Tradeoff between the degree of conservation and the number of occurrences

Overview • Introduction Multiple Alignments • Multiple alignment based on HMM • Motif Finding • Motif representation • Algorithm • Search Space • Word counting methods • Probabilistic methods • Profile Searches • Introduction • Exercises

Profile Search

Profile Search • GENOMICS • Genomic sequence data EXPERIMENTAL High throughput measurements Literature • 1. Microarray Datamining • Preprocessing • Clustering • 3. Comparative Genomics • Genomewide Screening • Phylogenetic Footprinting Clusters of coexpressed genes Novel targets Novel Conditions • 2. Sequence Datamining • Motif Detection Summarized information Target Identification

Bioinformatics

Bioinformatics

Presentation Transcript

Bioinformatics

Bioinformatics:

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

Bioinformatics

BIOINFORMATICS

Bioinformatics