620 likes | 819 Views
The ENCODE Project + After Party !!!!!. Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08. Outline. Estimating amount of functional sequences in the genome The ENCODE pilot project Research on ncRNAs Research on Alt.Splicing Fun in the sun….
E N D
The ENCODE Project +After Party !!!!! Alon “Alonzo” Sade & Harel “Hipoptam” Shein Advisor: Prof. Michal Linial (AKA “M”) 29.7.08
Outline • Estimating amount of functional sequences in the genome • The ENCODE pilot project • Research on ncRNAs • Research on Alt.Splicing • Fun in the sun…
The Challenge • Understanding how our genome encodes information • How that information underpins differences between individuals and species
The Human Genome • Currently estimated number of protein-coding genes: • Human: ∼20-25,000 • Sea urchin: ∼23,000 • Nematode worm: ∼19,000 • Tetrahymena thermophila: ∼ 27,000 (כי אין לנו שמרים) • We are complex, where is the information ? • Protein coding sequences account for <1.5% of the human genome • What is the function of the remainder ?
Where is the information ?!? • Alternative splicing ? • Non-protein-coding sequences contain large amounts of regulatory information ? • Recent discoveries say that the vast majority of the mammalian genome is transcribed • We’ll get back to that… Let’s define some things first
#DEFINE ncRNA • Non-coding RNA • An RNA that is not translated into a protein • Many members in this family • It was assumed that leftover RNA was “junk” • 2001 – Mattick claims: “more than 97% of RNA is ncRNA!”
#DEFINE ARs • Ancient Repeats • A CNE that was inserted into early mammalian lineage • Primarily transposon derived • Has since become dormant • Most are thought to be neutrally evolving
#DEFINE functional • Required for replication & structural integrity of the chromosome • Encode functional products • Required for regulation or processing • Includes sequences that may act as spacers
Transcriptions in the genome ? • At least 70% of the mammalian genome is transcribed • A lot of these are ncRNA shows cell-specific or developmental regulation • Functionality? • Noise, by-products for late evolving But all may also indicate functionality !
ncRNA in the genome • Recent evidence implicate ncRNAs in control of: • chromatin structure • epigenetic memory • Transcription • Translation • Splicing (possibly) • Most are evolving quickly but can maintain highly preserved regions in them
Let’s check out evolution… • ∼5% of small segments in mouse & human are under selection (May range between 3%-8%) • Doesn’t include sequences that have diverged for other reasons than evolution • At the time we thought only ∼1.2% is protein-coding 5
Conservation • conservation is relative • Taken to be substitution rate measured under the assumption “ functional evolving ⇔ neutral rate” • Requires estimate of the “neutral rate of evolution” • Classes expected to be evolving free of constraint Yes, everything is relative 5 שמירות התפתחות טבעית
Neutral rate of evolution • classes chosen have included: • Mainly ARs • Lineage-specific nonexonic sequences • Synonymous sites in codons 5 שמירות התפתחות טבעית 3 מאפיינים
Biased / Unbiased ? • Estimates based on ARs may be biased: • The annotated and aligned ARs may comprise mainly slowly evolving subset • ARs are under purifying selection • Lineage-specific & Nonexonic sequences Synonymoussites been found to be also biased
Neutral rate of evolution • The 5% study • Conservation • Netural rate • classes chosen have included: • Lineage-specific nonexonic sequences • Synonymous sites in codons • Ars • None of which is unbiased 5 שמירות התפתחות טבעית 3 מאפיינים
Different rates of evolution of functional sequences • Conclusions: Functionally RNAs illustrate: • Low conservation ↮ loss of functionally • Many functional transcripts have more relaxed structure-function constraints • Many functional elements are unconstrained • biologically active but provide no specific benefit to the organism
Conservation in the ENCODE CFTR locus • CFTR - cystic fibrosis transmembrane conductance regulator
What did we see so far? • Amount and function of the transcriptional output • Conservation Functionality estimates • Fractions of the genome under purifying selection may be have been underestimated • May get to 11.8%
The ENCODE Project • The ENCyclopedia Of DNA Elements
What is a Gene ??? Gene From Wikipedia, the free encyclopedia ”A gene is a locatable region of genomic sequence, corresponding to a unit of inheritance, which is associated with regulatory regions, transcribed regions and/or other functional sequence regions”
The ENCODE Project • A public research consortium • Launched by US National Human Genome Research Institute in September 2003 • Goal: identify all functional elements in the human genome sequence • Top-down research • Project Phases: • Pilot Phase • Technological Phase • Production Phase
Pilot Phase • Goal: • Evaluate a variety of different methods for use in later stages • Using a number of existing techniques to analyse a portion of the genome equal to about 1% (30mb) • 35 groups provided more than 200 experimental & computational data
Pilot Phase – Target Selection • 50% were selected manually • 50% were selected randomly • The two main criteria for manually : • The presence of well-studied genes or other known sequence elements • The existence of a substantial amount of comparative sequence data
Pilot Phase – Target Selection • The randomly selected sequences • composed of 500kb regions • selected according to a stratified random-sampling strategy based on • gene density – • #bases in genes/#other bases • level of non-exonic conservation • 125 bases windows, base alignment with mouse 75%+, score (prediction), took the low score
Technology Phase • The technology development phase is concurrent with the pilot phase • Goal: • Investigate and develop new, high throughput techniques and protocols suitable for the production phase
Relative proportion of different annotations among constrained sequences
Overlap of constrained sequences and various experimental annotations
Relationship between heterozygosity & polymorphic indel rate
A major challenge in the project • One major challenge in the ENCODE project is annotating the large number of ncRNAs • They are difficult to find in computational/experimental means • Why ?
Computational difficulties • We must consider secondary structure as well as nucleotide sequence • Structure can be detected more reliably from a set of related sequences • RNA secondary structure is imperative when searching for structured ncRNAs • So RNA search algorithms are expensive…
David Sankoff’s approach • In 1985 Sankoff suggested to perform sequence alignment and minimal free energy folding simultaneously • For two sequences of length n it’s O(n6) • Exponential in the number of sequences • Given the high cost, for many years it rested in oblivion... אתה מתחיל הכי חזק שאתה יכול ולאט לאט, אתה מגביר !
David Sankoff’s approach • Several approximation attempts have been developed • FOLDALIGN • Dynalign • Stemloc • Consan • All trying to increase performance w/o sacrificing accuracy • They still remain relatively expensive
Alternative approach • First align then fold • More attractive nowadays • RNAz & EvoFold use existing alignments • Thousands of new potential structured ncRNAs • restricted to highly conserved segments
EvoFold & RNAz disadvantages • As sequence similarity drops, frequent compensating base changes causes misalignments • Assumes RNA structure is present in all sequences in the alignment • Global alignments within fixed-width sliding window
Our solution • CMFinder • Search set of orthologous, unaligned seq. for conservation • Doesn’t use external alignments (\orthology) • Doesn’t use sliding windows
The candidates • We scanned 2*56,017 block from UCSC MULTIZ multiple alignment files • We restricted analysis to blocks that don’t overlap exons or conserved elements • 8.68 Mb (of 30), 3.87Mb repetitive sequences (RepeatMasker)
The results • 10,106 predicted motifs meeting cutoff • Composite score > 5 • Free energy < -5 • Estimated false-positive of 50%
The results – Contd. • Some predicted motifs overlap • Sense/antisense to each other • Considering as single candidates we have 6587 candidate regions • Average region length – 80 nt • Covering 6.1% of input • More dense in nonrepetative regions (7.9% against 3.9%)
The results – Contd. • ENCODE regions are poor in known ncRNAs • Only one known ncRNA fully overlapped our input (has-miR-483) • It received a high score (8.6, -31.4) • Also scored high as miRNA by RNAmicro
GENCODE overlaps • GENCODE annotations aim to identify all human protein-coding genes in the ENCODE regions • 40% of our candidates are intergenic • 60% overlap some non exonic part of a coding gene
Comparing to RNAz & EvoFold CMfinder(4933) EvoFold(3134) 3861 223 2581 106 743 224 2194 RNAz(3267)
Figure 3. Average pairwise sequence similarity of the predicted motifs versus the fraction that has been realigned compared to the original alignments Elfar Torarinsson et al. Genome Res. 2008; 18: 242-251
Experimental verification • To explore the biological relevance of our prediction methods, we selected 11 high-confidence candidates • score>9, energy<-15, length>60, base change>5 • We tested expression of these 11 candidates and found that 8 of 11 candidates could be detected in human RNA by RT-PCR