1 / 36

Motif Search

Motif Search. What are Motifs. Motif (dictionary) A recurrent thematic element, a common theme. Find a common motif in the text. Find a short common motif in the text. Motifs in biological sequences.

doem
Download Presentation

Motif Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif Search

  2. What are Motifs • Motif (dictionary) A recurrent thematic element, a common theme

  3. Find a common motif in the text

  4. Find a short common motif in the text

  5. Motifs in biological sequences Sequence motifs represent a short common sequence (length 4-20) which is highly represented in the data

  6. Challenges in biological sequences Motifs are usually not exact words

  7. 0.1 0.7 0.2 0.6 0.5 0.1 0.7 0.1 0.5 0.2 0.2 0.8 0.1 0.1 0.1 0.1 0.1 0.0 0.1 0.1 0.2 0.1 0.1 0.1 How to present non exact motifs? • Consensus string NTAHAWT May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; H=not G; S = C/G; R = A/G; Y = T/C etc. • Position Weight Matrix (PWM) Probability for each base in each position 2 3 4 5 6 1 A T G C

  8. Motifs in biological sequences What can we learn from these motifs? • Regulatory motifs in DNA (transcription factor binding sites) • Functional site in proteins (Phosphorylation site)

  9. DNA Regulatory Motifs • Transcription Factors (TF) are regulatory protein that bind to regulatory motifs near the gene and act as a switch bottom (on/off) • TF binding motifs are usually 6 – 20 nucleotides long • located near target gene, mostly upstream the transcription start site Transcription Start Site TF2 TF1 Gene X TF1 motif TF2 motif

  10. Can we find TF targets using a bioinformatics approach?

  11. P53 is a transcription factorinvolved in most human cancers We are interested to identify the genes regulated by p53

  12. Finding TF targets using a bioinformatics approach? Scenario 1 : Binding motif is known (easier case) Scenario 2 : Binding motif is unknown (hard case)

  13. Scenario 1 : Binding motif is known • Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence

  14. Given a consensus : For each position l in the input sequence, check if substring starting at position l matches the motif. Example: find the consensus motif NTAHAWT in the promoter of a gene >promoter of gene A ACGCGTATATTACGGGTACACCCTCCCAATTACTACTATAAATTCATACGGACTCAGACCTTAAAA…….

  15. Given a Position Weight Matrix (PWM): Starting from a set of aligned motifs Seq 1 AAAGCCC Seq 2 CTATCCA Seq 3 CTATCCC Seq 4 CTATCCC Seq 5 GTATCCC Seq 6 CTATCCC Seq 7 CTATCCC Seq 8 CTATCCC Seq 9 TTATCTG

  16. Given a string s of length l = 7 • s = s1s2…sl • Pr(s | W) = • Example: • Pr(CTAATCCG) = • 0.67 x 0.89 x 1 x 1 x 0.89 • x 1 x 0.89 x 0.11 Given a Position Weight Matrix (PWM): W Probability of each base In each column Counts of each base In each column Wk = probability of base  in column k

  17. Given a Position Weight Matrix (PWM) • Given sequence S (e.g., 1000 base-pairs long) • For each substring s of S, • Compute Pr(s|W) • If Pr(s|W) > some threshold, call that a binding site • In DNA sequences we need to search both strands AGTTACACCA TGGTGTAACT (reverse complement)

  18. Scenario 2 : Binding motif is unknown “Ab initio motif finding”

  19. Ab initio motif finding: Expectation Maximization • Local search algorithm - Start from a random PWM • Move from one PWM to another so as to improve the score which fits the sequence to the motif • Keep doing this until no more improvement is obtained : Convergence to local optima

  20. Expectation Maximization • Let W be a PWM . Let S be the input sequence . • Imagine a process that randomly searches, picks different strings matching W and threads them together to a new PWM

  21. Expectation Maximization • Find W so as to maximize Pr(S|W) • The “Expectation-Maximization” (EM) algorithm iteratively finds a new motif W that improves Pr(S|W)

  22. PWM 1. Start from a random motif 2. Scan sequence for good matches to the current motif. Build a new PWM out of these matches, and make it the new motif 3. Expectation Maximization

  23. The final PWM represents the motif which is mostly enriched in the data The PWM can be also represented as a sequence logo -A letter’s height indicates the information it contains -The top letter at each position can be read to obtain the consensus sequence (motif)

  24. Are common motifs the right thing to search for ?

  25. ?

  26. Solutions: -Searching for motifs which are enriched in one set but not in a random set - Use experimental information to rank the sequences according to their binding affinity and search for enriched motifs at the top of the list

  27. Searching for enriched motifs in a ranked list Hyper Geometric (HG) Distribution test 1 2 3 4 Binding affinity k= number of motifs in the top of the list m= number of sequences in the top of the list n= number of total motifs found N= total number of sequences The P reflects the surprise of seeing the observed density of motif occurrences at the top of the list compared to the rest of the list.

  28. Searching for enriched motifs in ranked list Choosing the best way to cut the list (minimal HG score) 1 2 3 4 Binding affinity k= number of motifs in the top of the list m= number of sequences in the top of the list n= number of total motifs found N= total number of sequences

  29. Finding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity >affinity = 5.962 ACAAAAGCGUGAACACUUCCACAUGAAAUUCGUUUUUUGUCCUUUUUUUUCUCUUCUUUUUCUCUCCUGUUUCU >affinity = 5.937 AAUAAAAAUAGAUAUAAUAGAUGGCACCGCUCUUCACGCCCGAAAGUUGGACAUUUUAAAUUUUAAUUCUCAUGA > affinity = 5.763 UCACACUUGAAUGUGCUGCACUUUACUAGAAGUUUCUUUUUCUUUUUUUAAAAAUAAAAAAAGAGGAGAAAAAUGC >affinity = 5.498 GCUGGUGCAAGUUUCCGGUAAAAAUAAUGAUGUUCUAGUCAUUCAUAUAUACGAUACAAAAAUAACA ... http://drimust.technion.ac.il/

  30. Protein Motifs Protein motifs are usually 6-20 amino acids long and can be represented as a consensus/profile: P[ED]XK[RW][RK]X[ED] or as PWM

  31. Protein Domains • In additional to protein short motifs, proteins are characterized by Domains. • Domains are long motifs (30-100 aa) and are considered as the building blocks of proteins (evolutionary modules). The zinc-finger domain

  32. Some domains can be found in many proteins with different functions:

  33. ….while other domains are only found in proteins with a certain function….. MBD= Methylated DNA Binding Domain

  34. Varieties of protein domains Extending along the length of a protein Occupying a subset of a protein sequence Occurring one or more times Page 228

  35. Pfam • > Database that contains a large collection of multiple sequence alignments of protein domains • Based on • Profile hidden Markov Models (HMMs). • HMM in comparison to PWM is a model • which considers dependencies between the • different columns in the matrix (different residues) and is thus much more powerful!!!! http://pfam.sanger.ac.uk/

  36. Profile HMM (Hidden Markov Model)can accurately represent a MSA D19 D16 D17 D18 100% 16 17 18 19 delete D R T R D R T S S - - S S P T R D R T R D P T S D - - S D - - S D - - S D - - R 100% 50% M16 M17 M18 M19 100% 100% 50% Match D 0.8 S 0.2 P 0.4 R 0.6 R 0.4 S 0.6 T 1.0 I16 I17 I18 I19 insert X X X X

More Related