650 likes | 784 Views
Tag-based expression/function analysis. Data files at webpage (link at todays date), and also: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/. Where are we now? R to do statistics Genome browsers and galaxy to visualize genes and genomics data
E N D
Tag-based expression/function analysis Data files at webpage (link at todays date), and also: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/
Where are we now? • R to do statistics • Genome browsers and galaxy to visualize genes and genomics data • Analyzing expression by microarrays +R and Bioconductor • Tag analysis • Proteomics
What we want in transcriptomics • Know what transcripts that are transcribed, and how much they are transcribed • Implicitly also what transcripts that exist in the cell, and how they look! • Intuitively, we could get all this information by sequencing all mRNAs in one cell
General problems with cDNA sequencing: • Reverse transcriptase falls off • Hard to sequence long transcripts • Many cDNAs are identical, but some occurs only once per cell (or less!). Need to sequence MANY cDNAs • Very expensive if you want to sequence all molecules
Solutions: • Do not sequence: use probes and hybridization: microarrays and tiling arrays ( this is where we are now!) 2) Only sequence parts of transcripts: tag sequencing (this is where we are getting)
Thought exercise • What are the pros/cons with hybridization (micro/tiling arrays) vs sequencing? 2 minutes with your sideman
+ Cheap(per “gene”) + Mature methods + Standardized -complex normalization needed - cross-hybridization - highly dependant on annotation of probes -dependant on designed probes for genes -Cannot deal with repeats +/-Integrative signal (more on next slide) - expensive (now, but changing) -”unbiased” - no designed probes - non-standard computational methods - more demanding processing (now) - much easier statistics in the end + less noisy + much higher resolution - up to nucleotide level + location information +/- Sampled signal (more on next slides) Albin’s take Sequencing Hybridization
Hybridization: integrative We have many identical probes. Each time a probe gets a hybridization event, we add a little to the signal. This includes non-optimal hybridization events - just something labeled that hybridizes will give some signal
Sequencing: sampling The number of cDNAs in a library is VERY LARGE We pick only some of them to do sequencing, randomly Blind sampling (does not know anything about RNAs) We map sequences back to the genome ( a kind of quality check)
Why is this interesting? • Sequencing approaches are generally better than hybridization in quality and you can also do more diverse experiments • New sequencers make it possible to do this almost as cheap as with hybridization – normal research groups can now buy the capacity of an old sequencing centre • It is basically the technology of the future
5 types of sequencing data data for expression – and functional- studies • Non-subtracted cDNA • ESTs • SAGE • CAGE • RNA-seq
Why so many techniques? • Historical reasons – technology development over time • Some of these technologies are only for expression – others also give other information (and different information) • Difference in costs - efficiency
Non-subtracted cDNA • Theoretically possible to sequence all cDNAs in a cell • Very, very expensive! • Hard to get true expression, since amplification is length-dependant • Not very necessary to have the whole cDNA for expression?
Expressed sequence tags ESTs Sequence from 5’ and 3’ ends – until the reverse transcriptase falls off Cheaper than full-length cDNAs Problems: many ESTs are simply trash – the result of over-enthusiastic sequencing For longer genes, no coverage of the middle part
How can we use ESTs? • View the EST as a ranom sample from a pool of transcripts: • The number of ESTs found from a transcript should be proportional to the concentration of that transcript in the cell=the expression • How do we know what transcripts an EST comes from?
Unigene:clustering ESTs to “genes” Back in the 90s, the idea was to use a lot of ESTs to find, and puzzle together, genes The UNIGENE database is one of the outcome of this. Slightly obsolete, but useful at times Basically, it tries to cluster ESTs and cDNAs to functional units: “genes” Bonus: we can use this to look at expression of these genes – because we can count ESTs from different libraries
Thought exercise: How? • Say that we have two lung EST libraries(= two collections of tags) from two patients, one who has lung cancer • How can we prove that a given gene, like RARA, is significantly altered in expression in lung cancer? • Think R! What do we need, and what tests should we use? • 2 minutes with your side man
“Electronic Northern blot” • In a nutshell: Fill in the following contingency table for a given gene Fisher exact test situation! We can do this within unigene for single genes
Side-story for non-life-scientists: Northern what? • Northern blot is classical method for detecting RNA molecules • Related to Southern and Western blot (DNA and protein detection methods)
However… • An electronic Northern is just a clever name, although it has the same goals - finding RNAs • It is nothing more than a statistical over-representation test of mRNAs, by use of ESTs
Unigene: • http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene • …or just google for unigene
Let’s look at the tissue constraints of human RARA… EST hits from different tissues Public microarray data (nice for comparison - but not important now)
Note that the sample sizes are very different! 1tag of 282332 is not the same as 1 tag out of 131488
What is TPM? TPM= Tags per million A normalization to be able to compare libraries of different sizes. Used very often for tag-based expression. “How many tags would my gene have we have if the sample size is 1 million?” …so, 10^6 * (#tags in my gene)/(#total tags)
Challenge • Is the RARA gene significantly different in expression in eye vs blood?
> a<-matrix(c( 12,12,124139-12, 210756-12), nrow=2,byrow=T) > fisher.test(a) Fisher's Exact Test for Count Data data: a p-value = 0.2078 # so,despite twice the TPM value, not significant
So ESTs are fantastic? …not really! Sometime useful but There are too few of them, and very diverse libraries …and way too expensive to make routinely in a normal lab Basically, ESTs are rarely used now, but it is data worth considering
Modern tag sequencing • SAGE, CAGE and RNASeq
Underlying idea: • Only sequence as much as you need: 5', 3' or whole cDNA (in pieces) • Map tags to known cDNAs or the genome (Thought exercise: what is the difference?)
SAGE • After sequencing: • Mask out adapters and primers • Make a database of all possible hits in mRNAs following the restriction site (white board demo) • Map tags to this database, or the genome • Mapping is surprisingly tricky • We cannot use BLAST or BLAT alignments (too short sequences) • Sequencing errors exist, as well as RNA editing • Some species have very few known mRNAs
Common approach First identify all unique tags, and how many times we have seen them AAAGATGCTGC 67 CAGTCGATCGAT 192 … Correlate these tags with our gene database. Sum up all the tags for each gene Make expression analysis!
How can we analyze count data? • The difference to micro arrays is that we deal with integers • The more counts for a gene, the more expressed it is - theoretically a linear relation. We are theoretically counting actual RNA molecules • Very much like the EST case, we can make statistics based on contingency tables if we have two samples
Data flow for tags …is a bit too complex for this course to do in real life - takes time and requires programming (and a big computer) Mapping of tags to genes is complex, and no standard solutions are adopted (yet) Statistical analysis often involves making multiple fisher exact tests - this involves some R programming To get a feeling for the data, we will instead use a website to to these things for us
Typical data after mapping: Tag Frequency AAAAAAAAAA 173 AAAAAAAAAG 1 AAAAAAAAAT 1 AAAAAAAATA 2 AAAAAAACAA 1 AAAAAAACTA 2 AAAAAAATAA 1 We want to go from here to actual counts per gene: we will let a web system do this for us
In the data directory, I have collected two such files:SAGE_Colon…, corresponding to normal and cancer colon • These are linked in the web page, also here: http://people.binf.ku.dk/albin/teaching/htbinf/tag_analysis/ • Then, go to http://cgap.nci.nih.gov/SAGE/ • This page has many SAGE-related analyses. We will try Digital Gene Expression Displayer (DGED)
Challenge • Using DGED • Use the “Two of your files” option to use the two colon samples. Select “short tags” • Try to understand what the statistical test does (accept defaults) • What types of genes are “over-expressed” in colon i) cancer tissue vs normal tissue, ii) normal tissue vs cancer tissue
Thought exercise • What are the limitations with SAGE?
Albin’s take • We can only measure expression – the location of tags in genes have no functional meaning • Dependent on gene annotation - we can map to the genome, but hard to interpret such data (what genes?) • Compared to array data: very few standard analysis methods • Limited sequencing depth
5’ tagging • Three methods that really do the same thing. Difference lies in chemistry and throughput and length of tags • CAGE • 5’SAGE • 5’ Oligo-capping • We will use CAGE as an example (“Cap Analysis of Gene Expression)
CAGE Sequencing and mapping to the genome
CAGE vs … • SAGE • Conceptually same thing, but you catch the 5’ end of the gene: the transcription start site and thereby the promoter– which is a functional entity • Higher number of tags • 5’ ends give functional data apart from expression
Issues • Only capped transcripts • Some real transcripts are not capped • Some capped transcripts are not full-length • Associating 5’ ends with gene products is sometimes problematic • We only know starts of genes, not the length • Tag length is borderline for mapping - 20-21 bp • Not clear how to define cutoffs - how many tags are “real biological promoter” • Under-sampling: we miss a lot of promoters because there are so many of them
Strengths We are actually looking at promoters, not genes Find novel promoters - sometimes within known genes We can look at expression at promoter level - for instance define “tissue-specific” promoters We can get a first unbiased look at where promoters are, and how much they are used in a given cell
CAGE concepts • The atom unit in CAGE is the tag, mapped to the genome. The tag comes from a given experiment (and has a label) • What positional information is the most relevant for analysis? ? ? The tag 20-21 bp
Only 5’ ends are interesting! • …since the 20 bp length is only for mapping purposes . • What if we have many tags overlapping one another? How can we represent this?