1 / 62

The ENCODE Project + After Party !!!!!

The ENCODE Project + After Party !!!!!. Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08. Outline. Estimating amount of functional sequences in the genome The ENCODE pilot project Research on ncRNAs Research on Alt.Splicing Fun in the sun….

neola
Download Presentation

The ENCODE Project + After Party !!!!!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The ENCODE Project +After Party !!!!! Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08

  2. Outline • Estimating amount of functional sequences in the genome • The ENCODE pilot project • Research on ncRNAs • Research on Alt.Splicing • Fun in the sun…

  3. The Challenge • Understanding how our genome encodes information • How that information underpins differences between individuals and species

  4. The Human Genome • Currently estimated number of protein-coding genes: • Human: ∼20-25,000 • Sea urchin: ∼23,000 • Nematode worm: ∼19,000 • Tetrahymena thermophila: ∼ 27,000 (כי אין לנו שמרים) • We are complex, where is the information ? • Protein coding sequences account for <1.5% of the human genome • What is the function of the remainder ?

  5. Where is the information ?!? • Alternative splicing ? • Non-protein-coding sequences contain large amounts of regulatory information ? • Recent discoveries say that the vast majority of the mammalian genome is transcribed • We’ll get back to that… Let’s define some things first

  6. #DEFINE ncRNA • Non-coding RNA • An RNA that is not translated into a protein • Many members in this family • It was assumed that leftover RNA was “junk” • 2001 – Mattick claims: “more than 97% of RNA is ncRNA!”

  7. #DEFINE ARs • Ancient Repeats • A CNE that was inserted into early mammalian lineage • Primarily transposon derived • Has since become dormant • Most are thought to be neutrally evolving

  8. #DEFINE functional • Required for replication & structural integrity of the chromosome • Encode functional products • Required for regulation or processing • Includes sequences that may act as spacers

  9. Transcriptions in the genome ? • At least 70% of the mammalian genome is transcribed • A lot of these are ncRNA shows cell-specific or developmental regulation • Functionality? • Noise, by-products for late evolving But all may also indicate functionality !

  10. ncRNA in the genome • Recent evidence implicate ncRNAs in control of: • chromatin structure • epigenetic memory • Transcription • Translation • Splicing (possibly) • Most are evolving quickly but can maintain highly preserved regions in them

  11. Let’s check out evolution… • ∼5% of small segments in mouse & human are under selection (May range between 3%-8%) • Doesn’t include sequences that have diverged for other reasons than evolution • At the time we thought only ∼1.2% is protein-coding 5

  12. Conservation • conservation is relative • Taken to be substitution rate measured under the assumption “ functional evolving ⇔ neutral rate” • Requires estimate of the “neutral rate of evolution” • Classes expected to be evolving free of constraint Yes, everything is relative 5 שמירות התפתחות טבעית

  13. Neutral rate of evolution • classes chosen have included: • Mainly ARs • Lineage-specific nonexonic sequences • Synonymous sites in codons 5 שמירות התפתחות טבעית 3 מאפיינים

  14. Biased / Unbiased ? • Estimates based on ARs may be biased: • The annotated and aligned ARs may comprise mainly slowly evolving subset • ARs are under purifying selection • Lineage-specific & Nonexonic sequences Synonymoussites been found to be also biased

  15. Neutral rate of evolution • The 5% study  • Conservation  • Netural rate  • classes chosen have included: • Lineage-specific nonexonic sequences • Synonymous sites in codons • Ars • None of which is unbiased 5 שמירות התפתחות טבעית 3 מאפיינים

  16. Different rates of evolution of functional sequences • Conclusions: Functionally RNAs illustrate: • Low conservation ↮ loss of functionally • Many functional transcripts have more relaxed structure-function constraints • Many functional elements are unconstrained •  biologically active but provide no specific benefit to the organism

  17. Conservation in the ENCODE CFTR locus • CFTR - cystic fibrosis transmembrane conductance regulator

  18. Figure 1. Conservation in the ENCODE CFTR locus

  19. What did we see so far? • Amount and function of the transcriptional output • Conservation  Functionality estimates • Fractions of the genome under purifying selection may be have been underestimated • May get to 11.8%

  20. The ENCODE Project • The ENCyclopedia Of DNA Elements

  21. What is a Gene ??? Gene From Wikipedia, the free encyclopedia ”A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”

  22. The ENCODE Project • A public research consortium • Launched by US National Human Genome Research Institute in September 2003 • Goal: identify all functional elements in the human genome sequence • Top-down research • Project Phases: • Pilot Phase • Technological Phase • Production Phase

  23. Pilot Phase • Goal: • Evaluate a variety of different methods for use in later stages • Using a number of existing techniques to analyse a portion of the genome equal to about 1% (30mb) • 35 groups provided more than 200 experimental & computational data

  24. Pilot Phase – Target Selection • 50% were selected manually • 50% were selected randomly • The two main criteria for manually : • The presence of well-studied genes or other known sequence elements • The existence of a substantial amount of comparative sequence data

  25. Pilot Phase – Target Selection • The randomly selected sequences • composed of 500kb regions • selected according to a stratified random-sampling strategy based on • gene density – • #bases in genes/#other bases • level of non-exonic conservation • 125 bases windows, base alignment with mouse 75%+, score (prediction), took the low score

  26. Technology Phase • The technology development phase is concurrent with the pilot phase • Goal: • Investigate and develop new, high throughput techniques and protocols suitable for the production phase

  27. Pilot Phase - Techniques

  28. Coverage of primary transcripts across ENCODE regions

  29. Relative proportion of different annotations among constrained sequences

  30. Overlap of constrained sequences and various experimental annotations

  31. Relationship between heterozygosity & polymorphic indel rate

  32. RACE technique & Transcript connectivity

  33. A major challenge in the project • One major challenge in the ENCODE project is annotating the large number of ncRNAs • They are difficult to find in computational/experimental means • Why ?

  34. Computational difficulties • We must consider secondary structure as well as nucleotide sequence • Structure can be detected more reliably from a set of related sequences • RNA secondary structure is imperative when searching for structured ncRNAs • So RNA search algorithms are expensive…

  35. David Sankoff’s approach • In 1985 Sankoff suggested to perform sequence alignment and minimal free energy folding simultaneously • For two sequences of length n it’s O(n6) • Exponential in the number of sequences • Given the high cost, for many years it rested in oblivion... אתה מתחיל הכי חזק שאתה יכול ולאט לאט, אתה מגביר !

  36. David Sankoff’s approach • Several approximation attempts have been developed • FOLDALIGN • Dynalign • Stemloc • Consan • All trying to increase performance w/o sacrificing accuracy • They still remain relatively expensive

  37. Alternative approach • First align then fold • More attractive nowadays • RNAz & EvoFold use existing alignments • Thousands of new potential structured ncRNAs • restricted to highly conserved segments

  38. EvoFold & RNAz disadvantages • As sequence similarity drops, frequent compensating base changes causes misalignments • Assumes RNA structure is present in all sequences in the alignment • Global alignments within fixed-width sliding window

  39. Our solution • CMFinder • Search set of orthologous, unaligned seq. for conservation • Doesn’t use external alignments (\orthology) • Doesn’t use sliding windows

  40. The candidates • We scanned 2*56,017 block from UCSC MULTIZ multiple alignment files • We restricted analysis to blocks that don’t overlap exons or conserved elements • 8.68 Mb (of 30), 3.87Mb repetitive sequences (RepeatMasker)

  41. The results • 10,106 predicted motifs meeting cutoff • Composite score > 5 • Free energy < -5 • Estimated false-positive of 50%

  42. The results – Contd. • Some predicted motifs overlap • Sense/antisense to each other • Considering as single candidates we have 6587 candidate regions • Average region length – 80 nt • Covering 6.1% of input • More dense in nonrepetative regions (7.9% against 3.9%)

  43. The results – Contd. • ENCODE regions are poor in known ncRNAs • Only one known ncRNA fully overlapped our input (has-miR-483) • It received a high score (8.6, -31.4) • Also scored high as miRNA by RNAmicro

  44. GENCODE overlaps • GENCODE annotations aim to identify all human protein-coding genes in the ENCODE regions • 40% of our candidates are intergenic • 60% overlap some non exonic part of a coding gene

  45. Comparing to RNAz & EvoFold CMfinder(4933) EvoFold(3134) 3861 223 2581 106 743 224 2194 RNAz(3267)

  46. Figure 3. Average pairwise sequence similarity of the predicted motifs versus the fraction that has been realigned compared to the original alignments Elfar Torarinsson et al. Genome Res. 2008; 18: 242-251

  47. Experimental verification • To explore the biological relevance of our prediction methods, we selected 11 high-confidence candidates • score>9, energy<-15, length>60, base change>5 • We tested expression of these 11 candidates and found that 8 of 11 candidates could be detected in human RNA by RT-PCR

More Related