390 likes | 1.22k Views
Lesson 9. Motif Search and RNA Structure Prediction. Instructions for the final project Introduction to Bioinformatics 2017-18. Key dates 26.12 lists of suggested projects published *
E N D
Lesson 9 Motif Search and RNA Structure Prediction
Instructions for the final project Introduction to Bioinformatics 2017-18 Key dates 26.12 lists of suggested projects published * *You are highly encouraged to choose a project yourself or find a relevant project which can help in your research 16.1- Final date to choose a project 22.1– meetings on projects (individual pairs with supervisor) 30.1- Submission project overview (one page) 16.3 Poster submission 21.3 Poster presentation (12-2PM)
Working on your project-step by step Projects are conducted in pairs ! 1. Choose a topic (either from the list or your own idea) 2. Preparing for the first meetings (10% of the final grade) After you have chosen the topic you should start planning the project • Make sure you understand the problem and read the necessary biological background • As most projects require working on a specific data set you should search for the most suitable data for your project . At this stage you should download the data and explore it • Now that you have the information you should formulate your working plan and think of the relevant tools you would need to use • Prepare questions Being prepared for the meeting is the first important step for succeeding in the project!!!
Write a proposal (10% of the final grade) Following the meeting with your supervisor you will need to write a short one page proposal , describing your project The proposal should be one page and include -Title -Main question -Major Tools you are planning to use to answer the questions 4. Working on your proposal Following the submission of the proposal and the feedback from your supervisor you can start working on the project following your proposed plan. Your initial results should guide you towards your next steps. During the work you are highly encouraged to consult with your supervisor. Important! make sure to summarize the results and extract the relevant information needed to answer your question, it is recommended to save the raw data for your records , but don't present raw data in your poster
5. Summarize the project When you feel you explored all tools you can apply to answer your question you should summarize your results and get to conclusions. At this stage it is highly recommended to set a meeting with your supervisor . ! Remember NO is also an answer as long as you are sure it is NO. .
6. Summarizing final project in a poster Prepare in PPT poster size 90-120 cm Title of the project Names and affiliation of the students presenting The poster should include 5 sections : Background should include description of your question (can add figure) Goal and Research Plan: Describe the main objective and the research plan Results (main section) : Present your results in 3-4 figures, describe each figure (figure legends) and give a title to each result Conclusions : summarized in points the conclusions of your project References : List the references of paper/databases/tools used for your project Examples of posters will be presented in class
Motif SearchScenario 2 : Binding motif is unknown “Ab initio motif finding” Why is it hard???
ChIP-Seq Chromatin Immunoprecipitation followed by sequencing Finding regions in the genome to which a DNA binding protein (transcription factor) binds to.
CLIP-Seq Cross-linking immunoprecipitation Identifying the regions in the transcriptome to which an RNA binding protein (e.g. splicing factor) binds to.
Finding the p53 binding motif in a set of p53 target sequences which are ranked according to binding affinity Best Binders ChIP –seq Weak Binders
a word search approach to search • for enriched motif in a ranked list CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGC CTGTGA CTGTGA CTGTGA Candidate k-mers CTGTGC CTATGC http://drimust.technion.ac.il/ CTACGC CTGTGA ACTTGA CTGTAC ACGTGA ATGTGC ACGTGC ATGTGA Ranked sequences list
Calculating the probability for enriched motif* base on the hypergeometric distribution CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA CTGTGA *can be used to calculate probability for any kind of enrichment in data Ranked sequences list The number of sequences containing the motif among the top sequences The number of sequences containing the motif The number of sequences at the top of the list The total number of input sequences
Calculating the hyper geometric (HG) score =1000 = 100 Pb ? = 200 b= The number of sequences containing the motif at the top of the list = 80 Pb is calculated using the fisher extract test (hyper geometric distribution) https://en.wikipedia.org/wiki/Fisher's_exact_test What is the probability to find this motif in the top of the list B= The number of sequences containing the motif N=The total number of input sequences N=The number of sequences at the top of the list
Fisher Exact test (hyper geometric distribution) Top Not top Seq+motif b B-b B Seq-no motif s S-s S n N-n N B! S! n!(N-n)! P= b! (B-b)! s! (S-s)!N!
In our case Top Not top Seq+motif 80 100 20 Seq-no motif 780 900 120 800 200 1000 100! 900! 200!800! P= = 9*10-44 80! 20! 120! 780!100! This is the p-value for the enrichment of CTGTGC suggesting that this can be the binding motif we are looking for Hypergeometric calculator http://stattrek.com/online-calculator/hypergeometric.aspx
protein RNA DNA According to the central dogma of molecular biology the main role of RNA is to transfer genetic information from DNA to protein
RNA has many other biological functions • Protein synthesis (ribosome) • Control of mRNA stability (UTR) • Control of splicing (snRNP) • Control of translation (microRNA) • Control of transcription (long non-coding RNA) The function of the RNA molecule depends on its folded structure
Ribosome Nobel prize 2009
RNA Structural levels Sequence GGGGGTGTAGCTCAGTGGTAGAGCGCGTGCTTAGCATGCACGAGGcCCCGGGTTCAATCCCCGGCACCTCCA Tertiary Structure Secondary Structure
3’ G A U C U U G A U C RNA Secondary Structure • RNA bases are G, C, A, U • The RNA molecule folds on itself. • The base pairing is as follows: G C A U G U hydrogen bond. Loop U U C G U A A U G C 5’ 3’ Stem 5’
Predicting RNA secondary Structure Most common approach: Search for a RNA structure with a Minimal Free Energy (MFE) U U C G U A A U G U G A U C U U G A U C C A U U G U G Low energy High energy
Free energy model Free energy of a structure is the sum of all interactions energies Free Energy(E) = E(CG)+E(CG)+….. The aim: to find the structure with the minimal free energy (MFE)
Why is MFE secondary structure prediction hard? • MFE structure can be found by calculating free energy of all possible structures • BUT the number of potential structures grows exponentially with the number of bases Solution :Dynamic programming (Zucker and Steigler)
Simplifying assumptions for RNA Structure Prediction • RNA folds into one minimum free-energy structure. • The energy of a particular base can be calculated independently • Neighbors do not influence the energy.
Sequence dependent free-energy Nearest Neighbor Model U U C G U A A U G C A UCGAC 3’ U U C G G C A U G C A UCGAC 3’ 5’ 5’ Free Energy of a base pair is influenced by the previous base pair
Sequence dependent free-energy values of the base pairs (nearest neighbor model) U U C G U A A U G C A UCGAC 3’ U U C G G C A U G C A UCGAC 3’ 5’ 5’ These energies are estimated experimentally from small synthetic RNAs. Example values: GC GC GC GC AU GC CG UA -2.3 -2.9 -3.4 -2.1
Improvements to the MFE approach • Positive energy - added for destabilizing regions such as bulges, loops, etc.
Free energy computation U U A A G C G C A G C U A A U C G A U A3’ A 5’ +5.9 4 nt loop -1.1 mismatch of hairpin -2.9 stacking +3.3 1nt bulge -2.9 stacking -1.8 stacking -0.9 stacking -1.8 stacking 5’ dangling -2.1 stacking -0.3 G= -4.6 KCAL/MOL -0.3
Improvements to the MFE approach • Positive energy - added for destabilizing regions such as bulges, loops, etc. • Looking for an ensemble of structures with low energy and generating a consensus structure WHY? RNA is dynamic and doesn’t always fold to the lowest energy structure
RNA fold prediction based on Multiple Alignment Information from multiple sequence alignment (MSA) can help to predict the probability of positions i,j to be base-paired. G C C U U C G G G C G A C U U C G G U C G G C U U C G G C C
Compensatory Substitutions Mutations that maintain the secondary structure can help predict the fold U U C G U A A U G C A UCGAC 3’ C G 5’
RNA secondary structure can be revealed by identification of compensatory mutations U C U G C G N N’ G C G C C U U C G G G C G A C U U C G G U C G G C U U C G G C C
Insight from Multiple Alignment Information from multiple sequence alignment (MSA) can help to predict the probability of positions i,j to be base-paired. • Conservation – no additional information • Consistent mutations (GC GU) – support stem • Inconsistent mutations – does not support stem. • Compensatory mutations – support stem.
v ?
Combining Sequence and Structural information into a new 8 letter alphabet 8 letter {A, a, G, g, C, c, U, u) 4 letter {A, G, C, U) A-single =A ; A-double=a G-single= G ; G-double=g C-single= C ; C-double=c U-single= U ; U-double=u