1 / 78

Biology and Cells

LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel: 6874-6877 Email: yzchen@cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, NUS. Biology and Cells.

jshatley
Download Presentation

Biology and Cells

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LSM3241: Bioinformatics and BiocomputingLecture 8: Gene Expression Profiles and Microarray Data AnalysisProf. Chen Yu ZongTel: 6874-6877Email: yzchen@cz3.nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, NUS

  2. Biology and Cells • All living organisms consist of cells (trillions of cells in human, yeast has one cell). • Cells are of many different types (blood, skin, nerve), but all arose from a single cell (the fertilized egg) • Each* cell contains a complete copy of the genome (the program for making the organism), encoded in DNA.

  3. Gene Expression • Cells are different because of differential gene expression. • About 40% of human genes are expressed at one time. • Gene is expressed by transcribing DNA into single-stranded mRNA • mRNA is later translated into a protein • Microarrays measure the level of mRNA expression

  4. Overview of Molecular Biology Nucleus Cell Chromosome Protein Gene (DNA) Gene (mRNA), single strand cDNA

  5. Gene Expression • Genes control cell behavior by controlling which proteins are made by a cell • House keeping genes vs. cell/tissue specific genes • Regulation: • Transcriptional (promoters and enhancers) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs)

  6. Gene Expression • Regulation: • Translational (3’UTR repressors, poly A tail) • Post Transcriptional (RNA splicing, stability, localization -small non coding RNAs) • Post Translational (Protein modification: carbohydrates, lipids, phosphorylation, hydroxylation, methlylation, precursor protein) cDNA

  7. Gene Expression Measurement • mRNA expression represents dynamic aspects of cell • mRNA expression can be measured by latest technology • mRNA is isolated and labeled with fluorescent protein • mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

  8. Traditional Methods • Northern Blotting • Single RNA isolated • Probed with labeled cDNA • RT-PCR • Primers amplify specific cDNA transcripts

  9. Microarray Technology • Microarray: • New Technology (first paper: 1995) • Allows study of thousands of genes at same time • Glass slide of DNA molecules • Molecule: string of bases (25 bp – 500 bp) • uniquely identifies gene or unit to be studied

  10. Gene Expression Microarrays The main types of gene expression microarrays: • Short oligonucleotide arrays (Affymetrix) • cDNA or spotted arrays (Brown/Botstein). • Long oligonucleotide arrays (Agilent Inkjet); • Fiber-optic arrays • ...

  11. Fabrications of Microarrays • Size of a microscope slide Images: http://www.affymetrix.com/

  12. Differing Conditions • Ultimate Goal: • Understand expression level of genes under different conditions • Helps to: • Determine genes involved in a disease • Pathways to a disease • Used as a screening tool

  13. Gene Conditions • Cell types (brain vs. liver) • Developmental (fetal vs. adult) • Response to stimulus • Gene activity (wild vs. mutant) • Disease states (healthy vs. diseased)

  14. Expressed Genes • Genes under a given condition • mRNA extracted from cells • mRNA labeled • Labeled mRNA is mRNA present in a given condition • Labeled mRNA will hybridize (base pair) with corresponding sequence on slide

  15. Two Different Types of Microarrays • Custom spotted arrays (up to 20,000 sequences) • cDNA • Oligonucleotide • High-density (up to 100,000 sequences) synthetic oligonucleotide arrays • Affymetrix (25 bases) • SHOW AFFYMETRIX LAYOUT

  16. Custom Arrays • Mostly cDNA arrays • 2-dye (2-channel) • RNA from two sources (cDNA created) • Source 1: labeled with red dye • Source 2: labeled with green dye

  17. Two Channel Microarrays • Microarrays measure gene expression • Two different samples: • Control (green label) • Sample (red label) • Both are washed over the microarray • Hybridization occurs • Each spot is one of 4 colors

  18. Microarray Technology

  19. Microarray Image Analysis • Microarrays detect gene interactions: 4 colors: • Green: high control • Red: High sample • Yellow: Equal • Black: None • Problem is to quantify image signals

  20. Single Color Microarrays • Prefabricated • Affymetrix (25mers) • Custom • cDNA (500 bases or so) • Spotted oligos (70-80 bases)

  21. Microarray Animations • Davidson University: • http://www.bio.davidson.edu/courses/genomics/chip/chip.html • Imagecyte: • http://www.imagecyte.com/array2.html

  22. Basic idea of Microarray • Construction • Place array of probes on microchip • Probe (for example) is oligonucleotide ~25 bases long that characterizes gene or genome • Each probe has many, many clones • Chip is about 2cm by 2cm • Application principle • Put (liquid) sample containing genes on microarray and allow probe and gene sequences to hybridize and wash away the rest • Analyze hybridization pattern

  23. Microarray analysis Operation Principle: Samples are tagged with flourescent material to show pattern of sample-probe interaction (hybridization) Microarray may have 60K probe

  24. Microarray Processing sequence

  25. Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)

  26. Some possible applications • Sample from specific organ to show which genes are expressed and responsible for a functionality • Compare samples from healthy and sick host to find gene-disease connection • Analyze samples to differentiate sick and healthy, disease subtypes, drug response groups • Probe samples, including human pathogens, for disease detection

  27. Huge amount of data from single microarray • If just two color, then amount of data on array with N probes is 2N • Cannot analyze pixel by pixel • Analyze by pattern – cluster analysis

  28. Major Data Mining Techniques • Link Analysis • Associations Discovery • Sequential Pattern Discovery • Similar Time Series Discovery • Predictive Modeling • Classification (assigns genes into known classes) • Clustering (groups genes into unknown clusters)

  29. Supervised vs. Unsupervised Learning • Supervised: there is a teacher, class labels are known • Support vector machines • Backpropagation neural networks • Unsupervised: No teacher, class labels are unknown • Clustering • Self-organizing maps

  30. Cluster Analysis: Grouping Similarly Expressed Genes, Cell Samples, or Both • Strengthens signal when averages are taken within clusters of genes (Eisen) • Useful (essential?) when seeking new subclasses of cells, diseases, drug responses etc. • Leads to readily interpreted figures

  31. Some clustering methods and software • Partitioning:K-Means, K-Medoids, PAM, CLARA … • Hierarchical:Cluster, HAC、BIRCH、CURE、ROCK • Density-based: CAST, DBSCAN、OPTICS、CLIQUE… • Grid-based:STING、CLIQUE、WaveCluster… • Model-based:SOM (self-organized map)、COBWEB、CLASSIT、AutoClass… • Two-way Clustering • Block clustering

  32. Partitioning

  33. Density-based clustering

  34. Hierarchical (used most often)

  35. Gene Expression Data Gene expression data on p genes for n samples mRNA samples sample1 sample2 sample3 sample4 sample5 … 1 0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.06 1.06 1.35 1.09 -1.09 ... Genes Gene expression level of gene i in mRNA sample j Log (Red intensity/Green intensity) = Log(Avg. PM - Avg. MM)

  36. -2 2 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. 1.5 -0.8 1.8 0.5 -0.4 -1.3 1.5 0.8 Numeric Vector Line Graph Heat map

  37. Expression Vectors As Points in ‘Expression Space’ t 1 t 2 t 3 G1 -0.8 -0.3 -0.7 G2 -0.7 -0.8 -0.4 G3 Similar Expression -0.4 -0.6 -0.8 G4 0.9 1.2 1.3 G5 1.3 0.9 -0.6 Experiment 3 Experiment 2 Experiment 1

  38. Cluster Analysis • Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.

  39. How can we do this? • What is closely related? • Distance or similarity metric • What is close? • Clustering algorithm • How do we minimize distance between objects in a group while maximizing distances between groups?

  40. Distance Metrics (5.5,6) • Euclidean Distance measures average distance • Manhattan (City Block) measures average in each dimension • Correlation measures difference with respect to linear trends (3.5,4) Gene Expression 2 Gene Expression 1

  41. Clustering Time Series Data • Measure gene expression on consecutive days • Gene Measurement matrix • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5]

  42. Euclidean Distance • Distance is the square root of the sum of the squared distance between coordinates

  43. City Block or Manhattan Distance • G1= [1.2 4.0 5.0 1.0] • G2= [2.0 2.5 5.5 6.0] • G3= [4.5 3.0 2.5 1.0] • G4= [3.5 1.5 1.2 1.5] • Distance is the sum of the absolute value between coordinates

  44. Correlation Distance • Pearson correlation measures the degree of linear relationship between variables, [-1,1] • Distance is 1-(pearson correlation), range of [0,2]

  45. Similarity Measurements • Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

  46. Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heat map and dendrogram • Cluster genes, samples or both (HCL-1)

  47. Hierarchical Clustering Venn Diagram of Clustered Data Dendrogram

  48. Hierarchical clustering • Merging (agglomerative): start with every measurement as a separate cluster then combine • Splitting: make one large cluster, then split up into smaller pieces • What is the distance between two clusters?

  49. Distance between clusters • Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster • Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster • Average: Distance between the average of all points in each cluster • Ward: minimizes the sum of squares of any two clusters

  50. Hierarchical Clustering-Merging • Euclidean distance • Average linking Distance between clusters when combined Gene expression time series

More Related