590 likes | 642 Views
This research seminar discusses the process of analyzing microarray data, including data acquisition, normalization, clustering, and pattern discovery. Topics covered include gene expression profiling, classification, and finding biologically related gene groups.
E N D
EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006
Administrative • The book “Elements of Statistical Learning" is on reserve in engineering library. • The other two books (Data Mining, Bioinformatics) are recalled. • Presentation paper selection is due this Friday, 3 are remaining: • Data Mining in Systems Biology • Data Mining in Proteomics • Analyzing Bionetworks
Administrative • Paper presenter • Always keep in mind the following four “w” questions in your presentation • What is the problem • Why the problem is important (“who cares”) • What are the related work (“why bother”) • Impacts of the presented work (“so what”) • Define and explain your computational task • Give intuitions before discussion details • Present the pros and cons of the methods • Audience: • Ask at least one question
Outline • What is Microarray? • Terms from molecular biology—This is NOT Bio101 • Goals • Raw data collection • Raw data analysis • Frequent pattern discovery in Microarry data analysis
Microarray • Microarrays are currently used to do many different things: • to detect and measure gene expression at the mRNA level • to find mutations and to genotype; • to sequence DNA; • to locate chromosomal changes and more. • There are many different types of microarrays • cDNA chips • Affymetrix chips
Goals of a Microarray Experiment • Find the genes that change expression between experimental and control samples • Classify samples based on a gene expression profile • Find patterns: Groups of biologically related genes that change expression together across samples/treatments
Microarray Procedure • In general there are two basic aspects of microarrays: • Data acquisition • Producing chips • preparing samples for detection; • hybridization; • scanning; • Data analysis • Low level analysis: normalization and significance test • High level analysis: clustering, classification, and pattern discovery • We are interested in the data analysis section. However, it is dangerous to go into data analysis without knowing how the data are collected
Microarrays are Popular • PubMed search "microarray"= 13,948 papers • 2005 = 4406 • 2004 = 3509 • 2003 = 2421 • 2002 = 1557 • 2001 = 834 • 2000 = 294
Necessary Background • Gene • Central Dogma • DNA • RNA • Nucleic acid hybridization
Genes • The human genome contains 23 pairs of chromosomes. • In each pair, one chromosome is paternally inherited, the other maternally inherited. • Chromosomes are made of compressed and entwined DNA. • A (protein-coding) geneis a segment of chromosomal DNA that directs the synthesis of a protein.
Central Dogma • The expression of the genetic information stored in the DNA molecule occurs in two stages: (i) transcription, during which DNA is transcribed into mRNA; (ii) translation, during which mRNA is translated to produce a protein. DNA mRNA protein
DNA • A deoxyribonucleic acid or DNAmolecule is a double-stranded polymer composed of four basic molecular units called nucleotides. • There are four types of nucleotides: • adenine (A), • guanine (G), • cytosine (C), and • thymine (T). • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T.
RNA • A ribonucleic acid or RNA molecule is a nucleic acid similar to DNA, but • It is single-stranded; • uracil (U) replaces thymine (T) as one of the bases. • RNA plays an important role in protein synthesis and other chemical activities of the cell. • Several classes of RNA molecules • messenger RNA (mRNA), • transfer RNA (tRNA), • ribosomal RNA (rRNA), • and other small RNAs.
DNA Chip Microarrays • Put a large number (~100K) of DNA sequences or synthetic DNA oligomers onto a glass slide in known locations on a grid. • Measure amounts of RNA bound to each square in the grid • Make comparisons • Cancerous vs. normal tissue • Treated vs. untreated • Time course • Many applications in both basic and clinical research
Spot your own Chip Robot spotter Ordinary glass microscope slide
Affymetrix “Gene chip” system • Commercial product • Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) • RNA labeled and scanned in a single “color” • Currently mass produced arrays targeting 17 different organisms • More than 40 different array types/sets • Proprietary system: “black box” software, can only use their chips
Data Acquisition in Microarray • Scan the arrays • Quantitate each spot • Subtract background • Normalize • Export a table of fluorescent intensities for each gene in the array
Data Acquisition scanning cDNA clones (probes) PCR product amplification purification laser 2 laser 1 mRNA target) emission printing overlay images and normalize Hybridise target to microarray microarray analysis
Function (Genome Ontology) Streamlined Array Analysis Normalize Filter •Present/Absent•Minimum value•Fold change Raw data (RMA) Significance Classification Clustering •t-test •SAM •Rank Product •Hierarchical CL •Biclustering •PAM •Machine learning Gene lists
Lower Level Data Analysis • Normalization: • when you have variability in measurements, you need replication and statistics to find real differences • Significance test: • It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates
Sources of Variability in Raw Data • Biological variability • Sample preparation • Probe labeling • RNA extraction • Experimental condition • temperature, time, mixing, etc. • Scanning • laser and detector, chemistry of the flourescent label • Image analysis • identifying and quantifying each spot on the array
Self-self hybridizations False colour overlay
Variability Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.
Data Normalization • Can control for many of the experimental sources of variability (systematic, not random or gene specific) • Bring each image to the same average brightness • Can use simple math or fancy: • divide by the mean (whole chip or by sectors) • LOESS (locally weighted regression) • No sure biological standards
Significance Test • In a microarray experiment, each gene (each probe or probe set) is really a separate experiment • Yet if you treat each gene as an independent comparison, you will always find some with significant differences • (the tails of a normal distribution)
False Discovery • Statisticians call false positives a "type 1 error" or a "False Discovery" • False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array • For a p-value of 0.01 X 10,000 genes = 100 false “different” genes • You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) • The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values
Higher Level Data Analysis • Computational tasks: • Clustering • Classification • Statistical validation • Data visualization • Pattern detection • Biological problems: • Discovery of common sequences in co-regulated genes • Meta-studies using data from multiple experiments • Linkage between gene expression data and gene sequence/function/metabolic pathways databases
Types of Clustering • Herarchical • Link similar genes, build up to a tree of all • Self Organizing Maps (SOM) • Split all genes into similar sub-groups • Finds its own groups (machine learning) • Principle Component • every gene is a dimension (vector), find a single dimension that best represents the differences in the data
Classification • How to sort samples into two classes based on gene expression data • Cancer vs. normal • Cancer sub-types: benign vs. malignant • Responds well to drug vs. poor response
Functional Genomics • Take a list of "interesting" genes and find their biological relationships • Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods • Requires a reference set of "biological knowledge"
GO • Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO) • 3 hierarchical sets of terminology • Biological Process • Cellular Component (location within cell) • Molecular Function • about 1000 categories of functions
Microarray Databases • Large experiments may have hundreds of individual array hybridizations • Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments • Data-mining - look for similar patterns of gene expression across different experiments
Public Databases • Gene Expression data is an essential aspect of annotating the genome • Publication and data exchange for microarray experiments • Data mining/Meta-studies • Common data format - XML • MIAME (Minimal Information About a Microarray Experiment)
Are the Treatments Different? • Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments • Before making these lists, ask the question: "Are the treatments different?" • Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test) • If there are differences, find the genes most responsible • If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.
Association Rules in Microarray Analysis • Row enumeration vs column enumeration • Suppose we have m rows and I columns • The search space of column enumeration is 2I • Reduces the search space to 2m when m << I • Supports rule set pruning using minimum support, minimum confidence, and chi-square
Background – Rule Groups • Large number of rules contained in a microarray data set • Rule sets contain a lot of redundancy • Makes interpretation difficult • Ex: 31 rules could be generated from class label {a, b, c, d, e, Cancer}, all with the same support • FARMER finds rule groups • 31 rules above would be grouped together • Only finds interesting rule groups. Consider abcd->Cancer with a confidence of 90% and ab->Cancer with a confidence of 95%. All rows covered by abcd must be covered by ab. Therefore, abcd->Cancer is not interesting.
Preliminaries and Definitions • D = dataset (rows), R = set of rows {r1,…,rn}, I = set of items {i1,…,im}, C = set of class labels {c1,…,ck} • Row support set: Given I’, R(I’) is the set of rows that contains I’ • Item support set: Given R’, I(R’) is the set of items common among the rows in R’
Example • I’ = {a,e,h} • R(I’) = {r2,r3,r4} • R’ = {r2,r3} • I(R’) = {a,e,h}