EECS 800 Research Seminar Mining Biological Data

EECS 800 Research SeminarMining Biological Data Instructor: Luke Huan Fall, 2006

Administrative • The book “Elements of Statistical Learning" is on reserve in engineering library. • The other two books (Data Mining, Bioinformatics) are recalled. • Presentation paper selection is due this Friday, 3 are remaining: • Data Mining in Systems Biology • Data Mining in Proteomics • Analyzing Bionetworks

Administrative • Paper presenter • Always keep in mind the following four “w” questions in your presentation • What is the problem • Why the problem is important (“who cares”) • What are the related work (“why bother”) • Impacts of the presented work (“so what”) • Define and explain your computational task • Give intuitions before discussion details • Present the pros and cons of the methods • Audience: • Ask at least one question

Outline • What is Microarray? • Terms from molecular biology—This is NOT Bio101 • Goals • Raw data collection • Raw data analysis • Frequent pattern discovery in Microarry data analysis

Microarray • Microarrays are currently used to do many different things: • to detect and measure gene expression at the mRNA level • to find mutations and to genotype; • to sequence DNA; • to locate chromosomal changes and more. • There are many different types of microarrays • cDNA chips • Affymetrix chips

Goals of a Microarray Experiment • Find the genes that change expression between experimental and control samples • Classify samples based on a gene expression profile • Find patterns: Groups of biologically related genes that change expression together across samples/treatments

Microarray Procedure • In general there are two basic aspects of microarrays: • Data acquisition • Producing chips • preparing samples for detection; • hybridization; • scanning; • Data analysis • Low level analysis: normalization and significance test • High level analysis: clustering, classification, and pattern discovery • We are interested in the data analysis section. However, it is dangerous to go into data analysis without knowing how the data are collected

Microarrays are Popular • PubMed search "microarray"= 13,948 papers • 2005 = 4406 • 2004 = 3509 • 2003 = 2421 • 2002 = 1557 • 2001 = 834 • 2000 = 294

Necessary Background • Gene • Central Dogma • DNA • RNA • Nucleic acid hybridization

Genes • The human genome contains 23 pairs of chromosomes. • In each pair, one chromosome is paternally inherited, the other maternally inherited. • Chromosomes are made of compressed and entwined DNA. • A (protein-coding) geneis a segment of chromosomal DNA that directs the synthesis of a protein.

Central Dogma • The expression of the genetic information stored in the DNA molecule occurs in two stages: (i) transcription, during which DNA is transcribed into mRNA; (ii) translation, during which mRNA is translated to produce a protein. DNA  mRNA  protein

DNA • A deoxyribonucleic acid or DNAmolecule is a double-stranded polymer composed of four basic molecular units called nucleotides. • There are four types of nucleotides: • adenine (A), • guanine (G), • cytosine (C), and • thymine (T). • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T.

RNA • A ribonucleic acid or RNA molecule is a nucleic acid similar to DNA, but • It is single-stranded; • uracil (U) replaces thymine (T) as one of the bases. • RNA plays an important role in protein synthesis and other chemical activities of the cell. • Several classes of RNA molecules • messenger RNA (mRNA), • transfer RNA (tRNA), • ribosomal RNA (rRNA), • and other small RNAs.

Nucleic acid hybridization: here DNA-RNA

DNA Chip Microarrays • Put a large number (~100K) of DNA sequences or synthetic DNA oligomers onto a glass slide in known locations on a grid. • Measure amounts of RNA bound to each square in the grid • Make comparisons • Cancerous vs. normal tissue • Treated vs. untreated • Time course • Many applications in both basic and clinical research

GeneChip

Spot your own Chip Robot spotter Ordinary glass microscope slide

Affymetrix “Gene chip” system • Commercial product • Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) • RNA labeled and scanned in a single “color” • Currently mass produced arrays targeting 17 different organisms • More than 40 different array types/sets • Proprietary system: “black box” software, can only use their chips

Data Acquisition in Microarray • Scan the arrays • Quantitate each spot • Subtract background • Normalize • Export a table of fluorescent intensities for each gene in the array

Hybridization to the Chip

The Chip is Scanned

Images

Data Acquisition scanning cDNA clones (probes) PCR product amplification purification laser 2 laser 1 mRNA target) emission printing overlay images and normalize Hybridise target to microarray microarray analysis

Function (Genome Ontology) Streamlined Array Analysis Normalize Filter •Present/Absent•Minimum value•Fold change Raw data (RMA) Significance Classification Clustering •t-test •SAM •Rank Product •Hierarchical CL •Biclustering •PAM •Machine learning Gene lists

Lower Level Data Analysis • Normalization: • when you have variability in measurements, you need replication and statistics to find real differences • Significance test: • It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates

Sources of Variability in Raw Data • Biological variability • Sample preparation • Probe labeling • RNA extraction • Experimental condition • temperature, time, mixing, etc. • Scanning • laser and detector, chemistry of the flourescent label • Image analysis • identifying and quantifying each spot on the array

Self-self hybridizations False colour overlay

Variability Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.

Data Normalization • Can control for many of the experimental sources of variability (systematic, not random or gene specific) • Bring each image to the same average brightness • Can use simple math or fancy: • divide by the mean (whole chip or by sectors) • LOESS (locally weighted regression) • No sure biological standards

Significance Test • In a microarray experiment, each gene (each probe or probe set) is really a separate experiment • Yet if you treat each gene as an independent comparison, you will always find some with significant differences • (the tails of a normal distribution)

False Discovery • Statisticians call false positives a "type 1 error" or a "False Discovery" • False Discovery Rate (FDR) is equal to the p-value of the t-test X the number of genes in the array • For a p-value of 0.01 X 10,000 genes = 100 false “different” genes • You cannot eliminate false positives, but by choosing a more stringent p-value, you can keep them manageable (try p=0.001) • The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values

Higher Level Data Analysis • Computational tasks: • Clustering • Classification • Statistical validation • Data visualization • Pattern detection • Biological problems: • Discovery of common sequences in co-regulated genes • Meta-studies using data from multiple experiments • Linkage between gene expression data and gene sequence/function/metabolic pathways databases

Types of Clustering • Herarchical • Link similar genes, build up to a tree of all • Self Organizing Maps (SOM) • Split all genes into similar sub-groups • Finds its own groups (machine learning) • Principle Component • every gene is a dimension (vector), find a single dimension that best represents the differences in the data

Cluster by Color Difference

GeneSpring

Classification • How to sort samples into two classes based on gene expression data • Cancer vs. normal • Cancer sub-types: benign vs. malignant • Responds well to drug vs. poor response

Functional Genomics • Take a list of "interesting" genes and find their biological relationships • Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods • Requires a reference set of "biological knowledge"

GO • Biologists got together a few years ago and developed a sensible system called Genome Ontology (GO) • 3 hierarchical sets of terminology • Biological Process • Cellular Component (location within cell) • Molecular Function • about 1000 categories of functions

Gene Ontology

Biological Pathways

Microarray Databases • Large experiments may have hundreds of individual array hybridizations • Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments • Data-mining - look for similar patterns of gene expression across different experiments

Public Databases • Gene Expression data is an essential aspect of annotating the genome • Publication and data exchange for microarray experiments • Data mining/Meta-studies • Common data format - XML • MIAME (Minimal Information About a Microarray Experiment)

GEO at the NCBI

Array Express at EMBL

Are the Treatments Different? • Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments • Before making these lists, ask the question: "Are the treatments different?" • Use standard statistical methods to evaluate expression profiles for each treatment (t-test or f-test) • If there are differences, find the genes most responsible • If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.

Association Rules in Microarray Analysis • Row enumeration vs column enumeration • Suppose we have m rows and I columns • The search space of column enumeration is 2I • Reduces the search space to 2m when m << I • Supports rule set pruning using minimum support, minimum confidence, and chi-square

Background – Rule Groups • Large number of rules contained in a microarray data set • Rule sets contain a lot of redundancy • Makes interpretation difficult • Ex: 31 rules could be generated from class label {a, b, c, d, e, Cancer}, all with the same support • FARMER finds rule groups • 31 rules above would be grouped together • Only finds interesting rule groups. Consider abcd->Cancer with a confidence of 90% and ab->Cancer with a confidence of 95%. All rows covered by abcd must be covered by ab. Therefore, abcd->Cancer is not interesting.

Preliminaries and Definitions • D = dataset (rows), R = set of rows {r1,…,rn}, I = set of items {i1,…,im}, C = set of class labels {c1,…,ck} • Row support set: Given I’, R(I’) is the set of rows that contains I’ • Item support set: Given R’, I(R’) is the set of items common among the rows in R’

Example • I’ = {a,e,h} • R(I’) = {r2,r3,r4} • R’ = {r2,r3} • I(R’) = {a,e,h}

EECS 800 Research Seminar Mining Biological Data