240 likes | 417 Views
2010-2011. Bioinformatics. Lecture 3 Finding Motifs. Dr. Aladdin Hamwieh Khalid Al- shamaa Abdulqader Jighly. Aleppo University Faculty of technical engineering Department of Biotechnology. Main Lines. Definition Motif types Motifs problem Motifs: Profiles and Consensus
E N D
2010-2011 Bioinformatics Lecture 3 Finding Motifs Dr. Aladdin Hamwieh Khalid Al-shamaa Abdulqader Jighly Aleppo University Faculty of technical engineering Department of Biotechnology
Main Lines • Definition • Motif types • Motifs problem • Motifs: Profiles and Consensus • Motif Logo • Motif Search in Local Database
Definition • A motif is a short conserved sequence pattern associated with distinct functions of a protein or DNA.
Motif Types • Regulatory sequences
Combinatorial Gene Regulation Combinatorial Gene Regulation • A microarray experiment showed that when gene X is knocked out, 20 other genes are not expressed • How can one gene have such drastic effects?
Regulatory Protein Combinatorial Gene Regulation • Gene X encodes regulatory protein, a.k.a. a transcription factor(TF) • The 20 unexpressed genes rely on gene X’s TF to induce transcription • A single TF may regulate multiple genes
Regulatory Regions • Every gene contains a regulatory region (RR) typically stretching 100-1000 bp upstream of the transcriptional start site • Located within the RR are the Transcription Factor Binding Sites(TFBS), also known as motifs, specific for a given transcription factor • TFs influence gene expression by binding to a specific location in the respective gene’s regulatory region - TFBS
Transcription Factor Binding Sites • A TFBS can be located anywhere within the Regulatory Region. • TFBS may vary slightly across different regulatory regions since non-essential bases could mutate
Motifs and Transcriptional Start Sites ATCCCG gene TTCCGG gene ATCCCG gene gene ATGCCG gene ATGCCC
0.1 0.7 0.2 0.6 0.5 0.1 0.7 0.1 0.5 0.2 0.2 0.8 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 Consensus considerations Transcription start site -35 hexamer -10 hexamer spacer interval TTGACA TATAAT 15 - 19 bases 5 - 9 bases A weight matrix contains more information 2 3 4 5 6 1 2 3 4 5 6 1 A A 0.1 0.1 0.1 0.5 0.2 0.5 T 0.7 0.7 0.2 0.2 0.2 0.2 T G 0.1 0.1 0.5 0.1 0.1 0.2 G C 0.1 0.1 0.2 0.2 0.5 0.1 C -35 -10 Based on ~450 known promoters
Example • GAL4 in Yeast • Activator of galactose-induced genes (convert galactose to glucose) • Protein structure determines motif • DNA-protein interactions require certain bases at specified locations • Motif reflects homodimer structure
Motif Types • Motifs in protein structure
Importance • Functional relationships between proteins cannot be distinguished through simple BLAST or FASTA database. • Proteins often perform multiple functions that cannot be fully described using a single annotation. • To resolve these issues, identification of the motifs and domains becomes very useful.
Random Sample atgaccgggatactgataccgtatttggcctaggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatactgggcataaggtacatgagtatccctgggatgacttttgggaacactatagtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgaccttgtaagtgttttccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatggcccacttagtccacttataggtcaatcatgttcttgtgaatggatttttaactgagggcatagaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtactgatggaaactttcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttggtttcgaaaatgctctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatttcaacgtatgccgaaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttctgggtactgatagca
Implanting Motif AAAAAAAGGGGGGG atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGatgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttataggtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa
The Challenge • Hard to identify • Relatively short sequences (as small as 6 bases) • Many positions not well conserved • Factors improving identification • Usually localized in certain proximity of a gene (search within 3 kb upstream) • Some positions highly conserved • Use other data (Microarray?)
Challenge Problem • Find a motif in a sample of: • 20 “random” sequences (e.g. 600 nt long) • each sequence containing an implanted pattern of length 15. • each pattern appearing with 4 mismatches as (15,4) motif.
Where is the Motif??? atgaccgggatactgatagaagaaaggttgggggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacaataaaacggcgggatgagtatccctgggatgacttaaaataatggagtggtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcaaaaaaagggattgtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatataataaaggaagggcttataggtcaatcatgttcttgtgaatggatttaacaataagggctgggaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtataaacaaggagggccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttaaaaaatagggagccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatactaaaaaggagcggaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttactaaaaaggagcgga
Why Finding (15,4) Motif is Difficult? atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccgacccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGatgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccgagctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggagatcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttataggtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaacggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcataacttgagttAAAAAAtAGGGaGccctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgtattggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatActAAAAAGGaGcGGaccgaaagggaagctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttActAAAAAGGaGcGGa AgAAgAAAGGttGGG ..|..|||.|..||| cAAtAAAAcGGcGGG
aGg t a c Tt CcAt a c g t Alignmenta c g t TAg t a c g t CcAt Cc g t a c gG _________________ A301031 10 ProfileC240 0140 0 G0 140 0 0 31 T0 0 0 51 014 _________________ ConsensusA C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct matrix profile with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest score in column Motifs: Profiles and Consensus