780 likes | 796 Views
LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Biology and Cells.
E N D
LSM3241: Bioinformatics and BiocomputingLecture 8: Gene Expression Profiles and Microarray Data AnalysisProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS
Biology and Cells • All living organisms consist of cells (trillions of cells in human, yeast has one cell). • Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg) • Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.
Gene Expression • Cells are different because of differential gene expression. • About 40% of human genes are expressed at one time. • Gene is expressed by transcribing DNA into single-stranded mRNA • mRNA is later translated into a protein • Microarrays measure the level of mRNA expression
Overview of Molecular Biology Nucleus Cell Chromosome Protein Gene (DNA) Gene (mRNA), single strand cDNA
Gene Expression • Genes control cell behavior by controlling which proteins are made by a cell • House keeping genes vs. cell/tissue specific genes • Regulation: • Transcriptional (promoters and enhancers) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)
Gene Expression • Regulation: • Translational (3’UTR repressors, poly A tail) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs) • Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein) cDNA
Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured by latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser
Traditional Methods • Northern Blotting • Single RNA isolated • Probed with labeled cDNA • RT-PCR • Primers amplify specific cDNA transcripts
Microarray Technology • Microarray: • New Technology (first paper: 1995) • Allows study of thousands of genes at same time • Glass slide of DNA molecules • Molecule: string of bases (25 bp – 500 bp) • uniquely identifies gene or unit to be studied
Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) • cDNA or spotted arrays (Brown/Botstein). • Long oligonucleotide arrays (Agilent Inkjet); • Fiber-optic arrays • ...
Fabrications of Microarrays • Size of a microscope slide Images: http://www.affymetrix.com/
Differing Conditions • Ultimate Goal: • Understand expression level of genes under different conditions • Helps to: • Determine genes involved in a disease • Pathways to a disease • Used as a screening tool
Gene Conditions • Cell types (brain vs. liver) • Developmental (fetal vs. adult) • Response to stimulus • Gene activity (wild vs. mutant) • Disease states (healthy vs. diseased)
Expressed Genes • Genes under a given condition • mRNA extracted from cells • mRNA labeled • Labeled mRNA is mRNA present in a given condition • Labeled mRNA will hybridize (base pair) with corresponding sequence on slide
Two Different Types of Microarrays • Custom spotted arrays (up to 20,000 sequences) • cDNA • Oligonucleotide • High-density (up to 100,000 sequences) synthetic oligonucleotide arrays • Affymetrix (25 bases) • SHOW AFFYMETRIX LAYOUT
Custom Arrays • Mostly cDNA arrays • 2-dye (2-channel) • RNA from two sources (cDNA created) • Source 1: labeled with red dye • Source 2: labeled with green dye
Two Channel Microarrays • Microarrays measure gene expression • Two different samples: • Control (green label) • Sample (red label) • Both are washed over the microarray • Hybridization occurs • Each spot is one of 4 colors
Microarray Image Analysis • Microarrays detect gene interactions: 4 colors: • Green: high control • Red: High sample • Yellow: Equal • Black: None • Problem is to quantify image signals
Single Color Microarrays • Prefabricated • Affymetrix (25mers) • Custom • cDNA (500 bases or so) • Spotted oligos (70-80 bases)
Microarray Animations • Davidson University: • http://www.bio.davidson.edu/courses/genomics/chip/chip.html • Imagecyte: • http://www.imagecyte.com/array2.html
Basic idea of Microarray • Construction • Place array of probes on microchip • Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome • Each probe has many, many clones • Chip is about 2cm by 2cm • Application principle • Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest • Analyze hybridization pattern
Microarray analysis Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray may have 60K probe
Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)
Some possible applications • Sample from specific organ to show which genes are expressed and responsible for a functionality • Compare samples from healthy and sick host to find gene-disease connection • Analyze samples to differentiate sick and healthy, disease subtypes, drug response groups • Probe samples, including human pathogens, for disease detection
Huge amount of data from single microarray • If just two color, then amount of data on array with N probes is 2N • Cannot analyze pixel by pixel • Analyze by pattern – cluster analysis
Major Data Mining Techniques • Link Analysis • Associations Discovery • Sequential Pattern Discovery • Similar Time Series Discovery • Predictive Modeling • Classification (assigns genes into known classes) • Clustering (groups genes into unknown clusters)
Supervised vs. Unsupervised Learning • Supervised: there is a teacher, class labels are known • Support vector machines • Backpropagation neural networks • Unsupervised: No teacher, class labels are unknown • Clustering • Self-organizing maps
Cluster Analysis: Grouping Similarly Expressed Genes, Cell Samples, or Both • Strengthens signal when averages are taken within clusters of genes (Eisen) • Useful (essential?) when seeking new subclasses of cells, diseases, drug responses etc. • Leads to readily interpreted figures
Some clustering methods and software • Partitioning:K-Means, K-Medoids, PAM, CLARA … • Hierarchical:Cluster, HAC、BIRCH、CURE、ROCK • Density-based: CAST, DBSCAN、OPTICS、CLIQUE… • Grid-based:STING、CLIQUE、WaveCluster… • Model-based:SOM (self-organized map)、COBWEB、CLASSIT、AutoClass… • Two-way Clustering • Block clustering
Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)
-2 2 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. 1.5 -0.8 1.8 0.5 -0.4 -1.3 1.5 0.8 Numeric Vector Line Graph Heat map
Expression Vectors As Points in ‘Expression Space’ t 1 t 2 t 3 G1 -0.8 -0.3 -0.7 G2 -0.7 -0.8 -0.4 G3 Similar Expression -0.4 -0.6 -0.8 G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1
Cluster Analysis • Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.
How can we do this? • What is closely related? • Distance or similarity metric • What is close? • Clustering algorithm • How do we minimize distance between objects in a group while maximizing distances between groups?
Distance Metrics (5.5,6) • Euclidean Distance measures average distance • Manhattan (City Block) measures average in each dimension • Correlation measures difference with respect to linear trends (3.5,4) Gene Expression 2 Gene Expression 1
Clustering Time Series Data • Measure gene expression on consecutive days • Gene Measurement matrix • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5]
Euclidean Distance • Distance is the square root of the sum of the squared distance between coordinates
City Block or Manhattan Distance • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5] • Distance is the sum of the absolute value between coordinates
Correlation Distance • Pearson correlation measures the degree of linear relationship between variables, [-1,1] • Distance is 1-(pearson correlation), range of [0,2]
Similarity Measurements • Pearson Correlation Two profiles (vectors) and +1 Pearson Correlation – 1
Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heat map and dendrogram • Cluster genes, samples or both (HCL-1)
Hierarchical Clustering Venn Diagram of Clustered Data Dendrogram
Hierarchical clustering • Merging (agglomerative): start with every measurement as a separate cluster then combine • Splitting: make one large cluster, then split up into smaller pieces • What is the distance between two clusters?
Distance between clusters • Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster • Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster • Average: Distance between the average of all points in each cluster • Ward: minimizes the sum of squares of any two clusters
Hierarchical Clustering-Merging • Euclidean distance • Average linking Distance between clusters when combined Gene expression time series