500 likes | 512 Views
CZ5211 Topics in Computational Biology Lecture 2: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Biology and Cells. All living organisms consist of cells.
E N D
CZ5211 Topics in Computational BiologyLecture 2: Gene Expression Profiles and Microarray Data AnalysisProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS
Biology and Cells • All living organisms consist of cells. • Humans have trillions of cells. Yeast - one cell. • Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg) • Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.
DNA • DNA molecules are long double-stranded chains; 4 types of bases are attached to the backbone: adenine (A), guanine (G), cytosine (C), and thymine (T). A pairs with T, C with G. • A gene is a segment of DNA that specifies how to make a protein. • Human DNA has about 25-35K genes; Rice about 50-60K but shorter genes.
Exons and Introns • exons arecoding DNA (translated into a protein), which are only about 2% of human genome • introns are non-coding DNA, which provide structural integrity and regulatory (control) functions • exons can be thought of program data, while introns provide the program logic • Humans have much more control structure than rice
Gene Expression • Cells are different because of differential gene expression. • About 40% of human genes are expressed at one time. • Gene is expressed by transcribing DNA into single-stranded mRNA • mRNA is later translated into a protein • Microarrays measure the level of mRNA expression
Molecular Biology Overview Nucleus Cell Chromosome Protein Gene (DNA) Gene (mRNA), single strand cDNA
Gene Expression • Genes control cell behavior by controlling which proteins are made by a cell • House keeping genes vs. cell/tissue specific genes • Regulation: • Transcriptional (promoters and enhancers) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)
Gene Expression • Regulation: • Translational (3’UTR repressors, poly A tail) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs) • Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein) cDNA
Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured with latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser
Traditional Methods • Northern Blotting • Single RNA isolated • Probed with labeled cDNA • RT-PCR • Primers amplify specific cDNA transcripts
Microarray Technology • Microarray: • New Technology (first paper: 1995) • Allows study of thousands of genes at same time • Glass slide of DNA molecules • Molecule: string of bases (25 bp – 500 bp) • uniquely identifies gene or unit to be studied
Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) • cDNA or spotted arrays (Brown/Botstein). • Long oligonucleotide arrays (Agilent Inkjet); • Fiber-optic arrays • ...
Fabrications of Microarrays • Size of a microscope slide Images: http://www.affymetrix.com/
Differing Conditions • Ultimate Goal: • Understand expression level of genes under different conditions • Helps to: • Determine genes involved in a disease • Pathways to a disease • Used as a screening tool
Gene Conditions • Cell types (brain vs. liver) • Developmental (fetal vs. adult) • Response to stimulus • Gene activity (wild vs. mutant) • Disease states (healthy vs. diseased)
Expressed Genes • Genes under a given condition • mRNA extracted from cells • mRNA labeled • Labeled mRNA is mRNA present in a given condition • Labeled mRNA will hybridize (base pair) with corresponding sequence on slide
Two Different Types of Microarrays • Custom spotted arrays (up to 20,000 sequences) • cDNA • Oligonucleotide • High-density (up to 100,000 sequences) synthetic oligonucleotide arrays • Affymetrix (25 bases) • SHOW AFFYMETRIX LAYOUT
Custom Arrays • Mostly cDNA arrays • 2-dye (2-channel) • RNA from two sources (cDNA created) • Source 1: labeled with red dye • Source 2: labeled with green dye
Two Channel Microarrays • Microarrays measure gene expression • Two different samples: • Control (green label) • Sample (red label) • Both are washed over the microarray • Hybridization occurs • Each spot is one of 4 colors
Microarray Image Analysis • Microarrays detect gene interactions: 4 colors: • Green: high control • Red: High sample • Yellow: Equal • Black: None • Problem is to quantify image signals
Single Color Microarrays • Prefabricated • Affymetrix (25mers) • Custom • cDNA (500 bases or so) • Spotted oligos (70-80 bases)
Microarray Animations • Davidson University: • http://www.bio.davidson.edu/courses/genomics/chip/chip.html • Imagecyte: • http://www.imagecyte.com/array2.html
Basic idea of Microarray • Construction • Place array of probes on microchip • Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome • Each probe has many, many clones • Chip is about 2cm by 2cm • Application principle • Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest • Analyze hybridization pattern
Microarray analysis Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray may have 60K probe
Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)
Some possible applications • Sample from specific organ to show which genes are expressed • Compare samples from healthy and sick host to find gene-disease connection • Probes are sets of human pathogens for disease detection
Huge amount of data from single microarray • If just two color, then amount of data on array with N probes is 2N • Cannot analyze pixel by pixel • Analyze by pattern – cluster analysis
Major Data Mining Techniques • Link Analysis • Associations Discovery • Sequential Pattern Discovery • Similar Time Series Discovery • Predictive Modeling • Classification • Clustering
Cluster Analysis: Grouping Similarly Expressed Genes, Cell Samples, or Both • Strengthens signal when averages are taken within clusters of genes (Eisen) • Useful (essential ?) when seeking new subclasses of cells, tumours, etc. • Leads to readily interpreted figures
Some clustering methods and software • Partitioning:K-Means, K-Medoids, PAM, CLARA … • Hierarchical:Cluster, HAC、BIRCH、CURE、ROCK • Density-based: CAST, DBSCAN、OPTICS、CLIQUE… • Grid-based:STING、CLIQUE、WaveCluster… • Model-based:SOM (self-organized map)、COBWEB、CLASSIT、AutoClass… • Two-way Clustering • Block clustering
Assessment of various methods • Algorithmic Approaches to Clustering Gene Expression Data, Ron Shamir School of Computer Science, Tel-Aviv University Tel-Aviv • http://citeseer.nj.nec.com/shamir01algorithmic.html • Conclusion: hierarchical clustering exceptional
Hierarchical Clustering: grouping similarly expressed genes Gene Expression Profile Analysis Sample … … …. B C A gene 0.4 0.9 0 0.5 .. .. 0.8 0.2 0.8 0.3 0.2 .. .. 0.7 0.6 0.2 0 0.7 .. .. 0.3 … … … … … … … 1 2 3 4 .. .. 1000
After Clustering Gene Expression Profile Analysis sample … … …. B C A gene .. 0 0.4 0.5 .. 0.9 0.8 .. 0.3 0.2 0.2 .. 0.8 0.7 .. 0 0.6 0.7 .. 0.2 0.3 … … … … … … … .. 3 1 4 .. 2 1000
randomized row column both data clustered Eisen et al. Proc. Natl. Acad. Sci. USA 95 (1998) time
Types of Similarity Measurements • Distance measurements • Correlation coefficients • Association coefficients • Probabilistic similarity coefficients
Correlation Coefficients • The most popular correlation coefficient is Pearson correlationcoefficient (1892) • correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}: • where sXYis the similarity between X & Y sXY
Use of Similarity for Tree Construction • Normalize similarity so that =1 • Then have nxn similarity matrixS whose diagonal elements are 1 • Define distance matrix by (for example) D = 1 – S Diagonal elements of D are 0 • Now use distance matrixto built tree (using some tree-building software recall lecture on Phylogeny) sXX
A dendrogram (tree) for clustered genes E.g. p=5 Let p = number of genes. 1. Calculate within class correlation. 2. Perform hierarchical clustering which will produce (2p-1) clusters of genes. 3. Average within clusters of genes. 4 Perform testing on averages of clusters of genes as if they were single genes. Cluster 6=(1,2) Cluster 7=(1,2,3) Cluster 8=(4,5) Cluster 9= (1,2,3,4,5) 1 2 3 4 5
A real case Nature Feb, 2000 Paper by Allzadeh. A et al Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling
Validation Techniques: Hubert’s Γ Statistics • X=[X(i, j)] andY=[Y(i, j)] are two n×n matrix • X(i, j): similarity of gene i and gene j • Hubert’s Γ statistic represents the point serial correlation: • where M = n (n - 1) / 2 • A higher value of Γ represents the better clustering quality. if genes i and j are in same cluster, otherwise
Gene Expression is Time-Dependent Time Course Data
Sample of time course of clustered genes time time time
Limitations • Cluster analyses: • Usually outside the normal framework of statistical inference • Less appropriate when only a few genes are likely to change • Needs lots of experiments • Single gene tests: • May be too noisy in general to show much • May not reveal coordinated effects of positively correlated genes. • Hard to relate to pathways
Useful Links • Affymetrix www.affymetrix.com • Michael Eisen Lab at LBL (hierarchical clustering software “Cluster” and “Tree View” (Windows)) rana.lbl.gov/ • Review of Currently Available Microarray Softwarewww.the-scientist.com/yr2001/apr/profile1_010430.html • ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/ • Stanford MicroArray Database http://genome-www5.stanford.edu/ • Yale Microarray Database http://info.med.yale.edu/microarray/ • Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html