Beyond the Human Genome: Transcriptomics

Beyond the Human Genome:Transcriptomics Dr Jen Taylor Henry Wellcome Centre for Gene Function Bioinformatics Department of Statistics taylor@stats.ox.ac.uk

Beyond the Human Genome: 1995 Human Genome sequencing begins in earnest “Mapping the Book of Life” 1999 Human Genome 2000 - First Draft Human Genome 2003 - Essential Completion Human Genome = approx 140, 000 genes = 30, 000 – 40,000 genes ?? = 24, 195 genes !!!??? Commemorative stained glass window for F.C. Crick, designed by Maria McClafferty.(Photograph: Paul Forster) Gonville & Caius College, Cambridge, UK.

Complexity Regulation Transcriptome Beyond the Human Genome: Gene Number ≠ Complexity Gene Commemorative stained glass window for F.C. Crick, designed by Maria McClafferty.(Photograph: Paul Forster) Gonville & Caius College, Cambridge, UK.

Introduction: The scope of transcriptomics – a definition of the transcriptome Part I: Observing the transcriptome Experimental methodology Data analysis Part II: Using the transcriptome The regulation of the trancriptome The transcriptome and the genome The transcriptome and the proteome Beyond the Human Transcriptome

Transcriptome: “transcriptome, the mRNAs expressed by a genome at any given time..” (Abbott, 1999)

Central Dogma of Molecular Biology • mRNA – single stranded RNA molecule • Complementary to DNA • Processed (spliced and polyadenylated) RNA transcript • Carries the sequence of a gene out of the nucleus into the cytoplasm where it can be translated into a protein structure Image: Access Excellence, National Institutes of Heath

Transcriptome: An evolving definition • (the population of) mRNAs expressed by a genome at any given time (Abbott, 1999) • The complete collection of transcribed elements of the genome. (Affymetrix, 2004) • mRNAs: 35, 913 transcripts (including alternative spliced variants) • Non-coding RNAs • tRNAs (497 genes) • rRNAs (243 genes) • snmRNAs (small non-messenger RNAs) • microRNAs and siRNAs (small interferring RNAs) • snoRNAs (small nucleolar RNAs) • snRNAs (small nuclear RNAs) • Pseudogenes (~ 2,000)

The human transcriptome Nucleotides High density oligonucleotide arrays across 11 different cell lines ~ 70% of transcripts non-coding ~79-88% have multiple transcripts Kapranov et al., 2002 ~ 90% of transcribed nucleotides outside annotated exons The dimensions of the unique transcriptome?? >>> current 40,000 estimate Kampa et al., Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Research. 2004

Transcriptomics Scope • the population of functional RNAtranscripts. • the mechanisms that regulate the production of RNA transcripts • dynamics of the trancriptome (time, cell type, genotype, external stimuli) Definition The study of characteristics and regulation of the functional RNA transcript population of a cell/s or organism at a specific time.

Observing the transcriptome High-throughput friendly Genome Predicts Biology ** Regulatory network Transcriptome Context dependent and dynamic Proteome **Li et al., 2004

Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Schena M, Shalon D, Davis RW, Brown PO. Stanford University Medical Center, CA. “ The challenge is no longer in the expression arrays themselves, but in developing experimental designs to exploit the full power of a global perspective.” Eric Lander Publications: Expression Profiling vs Proteomics Data from PubMed

Observing the transcriptome? Classic Human Transcriptome Profiling Studies: Trancriptome reflects Biology Golub et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999. ALL – acute lymphoblastic leukemia AML – acute myeloid leukemia Scherf et al., A gene expression database for the molecular pharmacology of cancer. Nature Genetics 2000 60 human cancer cell lines

Observing the transcriptome • Focussed Experimental Approaches: • Northern Blotting Analysis • Real time PCR (quantitative or semi-quantitative) • Highthroughput Approaches: • Closed System Profiling: • Microarray expression profiling •  Open System Profiling: • Serial analysis of gene expression (SAGE) • Massively Parallel Signature Sequencing (MPSS)

Red – increase of Cy5 sample transcripts Green – increase of Cy3 sample transcripts Yellow – equal abundance Limit of Detection: 1 in 30,000 transcripts ~ 20 transcripts/cell

Cell population A Cell population B RNA extraction Quantify pixel intensities. A A B B Reverse transcription “Overlay images” A A B B Klenow label incorporation Scan cy5 channel Sample B labelled with cy3 dye Sample A labelled with cy5 dye Scan cy3 channel Hybridisation Washing Experimental overview:

Red – increase of Cy5 sample transcripts Green – increase of Cy3 sample transcripts Yellow – equal abundance Limit of Detection: 1 in 30,000 transcripts ~ 20 transcripts/cell

Platforms and Formats • Isotope • Nylon – cDNA (300-900 nt) • Two-colour • Glass • cDNA or Oligo (80 nt) • 500 – 11,000 elements • Affymetrix • Silicone – oligo (20 nt) • 22 ,000 elements • Tissue Arrays • Glass • Tissue Discs (20-150)

Affymetrix GeneChip® Limits: 1: 100,000 transcripts ~ 5 transcripts/cell Affymetrix GeneChip®

http://www.affymetrix.com

Affymetrix: Gene Expression Arrays Transcripts/Genes Arabidopsis Genome 24,000 C. elegans Genome 22,500 Drosophila Genome 18, 500 E. coli Genome 20, 366 Human Genome U133 Plus 47,000 Mouse Genome 39, 000 Yeast Genome 5, 841 (S. cerevisiae) & 5, 031 (S. pombe) Rat Genome 30, 000 Zebrafish 14, 900 Plasmodium/Anopheles 4,300 (P. falciparum) & 14,900 (A. gambiae) Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700) Canine (21,700), Bovine (23,000) B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)

Microarray and GeneChip Approaches Advantages: • Rapid • Method and data analysis well described and supported • Robust • Convenient for directed and focussed studies Disadvantages: • Closed system approach • Difficult to correlate with absolute transcript number • Sensitive to alternative splicing ambiguities

Serial Analysis of Gene Expression (SAGE) • The principles: • Velculescu et al., Science 1995 • A transcript (new or novel) can be recognised by a small subset (e.g. 14) of its nucleotides – a tag • Linking tags allows for rapid sequencing. • Open system for transcript profiling Modified SAGE methods LongSAGE (21 nt) SAGE-lite, micro-SAGE, mini-SAGE RASL/DASL methods (5’ and 3’ Tags) 14 nt TAG AAAAAAAAA – 3’ TAG AAAAAAAAA – 3’ TAG AAAAAAAAA – 3’ TAG AAAAAAAAA – 3’ AGCTTGAACCGTGACATCATGGCCATTGGCCCCAATTGAGACAGTGAGTTCAATGC TAG TAG TAG TAG Sequence

SAGE Advantages: • Potential ‘open’ system method – new transcripts can be identified • Accuracy of unambiguous transcript observation • Digital output of data • Quantitative and qualitative information Disadvantages: • Characterising novel transcripts is often computationally difficult from short tag sequences • Tag specificity (recently increased length to 21 bp) • Length of tags can vary (RE enzyme activity variable with temperature) • A subset of transcripts do not contain enzyme recognition sequence • Sensitive to a subset of alternative splice variants

Biological question Sample Attributes Experimental design Platform Choice 16-bit TIFF Files Microarray experiment (Rspot, Rbkg), (Gspot, Gbkg) Image analysis Normalization StatisticalAnalysis Clustering Data Mining Pattern Discovery Classification Biological verification and interpretation

Analysis 188, 000 47,000 x 2 x 2 datapoints Liver 47,000 x 2 x 2 datapoints 188, 000 Brain 47,000 x 2 x 2 datapoints Lymphocyte 188, 000

Analysis Essential problem: Given a large dataset with technical and biological noise: Find: A) Transcripts: patterns (common themes or differences) measures of robustness or some idea of uncertainty B) Sample: similarities or differences between samples on global/multi-gene level

Analysis Brain Liver Lymphocytes Which transcripts are different? What are the patterns?

Biologists Nightmare: Statisticians Playground Characteristics of the expression profiling data: • High dimensionality • Sample number (n) low and observation number high (p) • Non-independence of observations • Complex patterns: visualisation and extraction • Incorporation of contextual information • Standardisation and data sharing • Integration of & with other data types

Analysis Methods • Classical parametric & non-parametric statistical tests for hypothesis testing • Unsupervised clustering algorithms Hierarchical clustering Kmeans and Self-Organising Maps • Classification e.g. Machine learning and Linear discriminant analysis • Dimensionality Reduction or Principal Component Analysis e.g. Gene Shaving and Multi-dimensional Scaling • Probabilistic Modelling Dynamic Bayesian Networks Markov Models

Analysis Methods Classical Parametric Statistical Analysis: Tools: T-test ANOVA Mann Whitney U Test Fold Change Liver Brain Lymphocyte

Analysis Methods Classical Parametric Statistical Analysis: (P=0.01) 20,000 transcripts = 200 transcripts • Difficulties • Assumes that observations are normally distributed and independent • ‘Statistical significance’ does not equal biological significance • Appropriate multiple testing corrections are difficult ???

Analysis Methods Clustering Approaches: Divides or groups genes/samples into groups “clusters”, based on similarities and differences Number of groups is user defined Algorithms: Hierarchical clustering Kmeans clustering Self organising maps

to to Distance Metrics Time Distance between 2 expression vectors EuclideanPearson(r*-1) 1.4 -0.90 4.2 -1.00

Pearson Distance Euclidean Distance Distance Metric Transcription Factor Transcript Target Transcript 1 Target Transcript 2

g1 g1 g1 g8 g2 g8 g3 g4 g2 g2 g3 g4 g5 g4 g3 g5 g6 g5 g7 g6 g6 g7 g8 g7 Hierarchical Clustering g1 is most like g8 g4 is most like {g1, g8}

g1 g8 g4 g5 g7 g2 g3 g6 Hierarchical Tree

Clustering: Case Study Sorlie et al., 2001 Breast tissue subtypes Hierarchical clustering

K-means clustering Partition or centroid algorithms Step 1: User specifies K clusters x K = 3 x Expression Level Brain x Liver Expression Level

K-means clustering Step 2 – Using Euclidean distance nearest points assigned to clusters (k) Step 3 – New centroids calculated x K = 3 x x

Iterates until centroids don’t move K-means clustering Step 4 – Points re-assigned to nearest centroid Step 5 – New centroids calculated K = 3

Classification Transcript B Transcript A K-nearest neighbour methods (KNN) Linear Discriminant Analysis (LDA) Machine Learning: Support Vector Machines Neural Network Analysis Adapted from Florian Markowetz

Classification Training Set 2/3 sample set Test Set 1/3 sample set Define Classification Rule Linear Discriminant Analysis KNN Gene B Gene A

Classification More complex classifiers Gene B Gene A KNN – Voting scheme – (k=3) Use three closest points to classify Adapted from Florian Markowetz

Probabilistic Modelling • Incorporate dependencies and prior knowledge into the identification of patterns/clusters: • - relationships in time between samples • - relationships between genes • Handle measures of uncertainty well • Conceptually simple, consideration needed on implementation • Markov modelling • Dynamic bayesian networks

Analysis Methods • Classical parametric & non-parametric statistical tests for hypothesis testing • Unsupervised clustering algorithms Hierarchical clustering Kmeans and Self-Organising Maps • Classification Machine learning and Linear discriminant Analysis • Dimensionality Reduction or Principal Component Analysis Gene Shaving and Multi-dimensional Scaling • Probabilistic Modelling Dynamic Bayesian Networks and Pattern recognition Markov Models

Introduction: The scope of transcriptomics – a definition of the transcriptome Part I: Observing the transcriptome Experimental methodology Data curation and analysis pipelines Part II: Using the transcriptome The regulation of the trancriptome The transcriptome and the genome The transcriptome and the proteome Beyond the Human Transcriptome

…. to be continued.

Introduction: The scope of transcriptomics – a definition of the transcriptome Part I: Observing the transcriptome Experimental methodology Data curation and analysis pipelines Part II: Using the transcriptome The regulation of the trancriptome The transcriptome and the genome The transcriptome and the proteome Beyond the Human Transcriptome

Beyond the Human Genome: Transcriptomics