220 likes | 615 Views
Analysis and Management of Microarray Data. Dr G. P. S. Raghava. Major Applications Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar
E N D
Analysis and Management of Microarray Data Dr G. P. S. Raghava
Major Applications Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Use expression profile of known disease to diagnosis and classify of unknown genes
Management of Microarray Data Magnitude of Data – Experiments 50 000 genes in human 320 cell types 2000 compunds 3 times points 2 concentrations 2 replicates – Data Volume 4*1011 data-points 1015= 1 petaB of Data
Gene expression database – a conceptual view Samples Sample annotations Gene expression matrix Genes Gene annotations Gene expression levels
Management of Microarray Data Major Issues Large volume of microarray data in last few years – Storage and efficient access – Comparison and integration of data Problem of data access and exchange – Data scattered around Internet – Supplementary material of publications – Difficult for user to access relivent data Problems with existing databases – Diverse purpose – Developed for specific purpose
Management of Microarray Data Specific Database – Platform (eg.Stanford MA Database; SMD) – Organism (Yeast MA global viewer) – Project (Life cycle database of Drosophila) Problem with Supplement and MA databases – Lack of direct access – Quality not checked – No standard format – Incomplete data
Comprehensive database server to manage massive amount of Microarray Data – Biomaterial Information – Raw Data & Images – Web Tools (normalization; data viewing; analysis) Run on local servers allows full management and permission to add and view data Minimum Information about Microarray Experiment (MIAME) BASE http://bioinformatics1.uams.edu:8081:/
Public Databases Gene Expression data is an essential aspect of annotating the genome Publication and data exchange for microarray experiments Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a Microarray Experiment)
Microarray Data Mining Challenges too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Too many columns likely to lead to False positives for exploration, a large set of all relevant genes is desired for diagnostics or identification of
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data – Subtraction of Background Noise – Global/local Normalization – House keeping genes (or same gene) – Expression in ratio (test/references) in log Differential Gene expression – Repeats and calculate significance (t-test) – Significance of fold used statistical method Clustering – Supervised/Unsupervised (Hierarchical, K-means, SOM) Prediction or Supervised Machine Learnning (SVM)
Low Level Analysis or Preprocessing of gene expression data Scale Transformation Normalization and Scaling Replicate Handling Missing value Handling Flat pattern filtering Pattern standardization
Normalization Techniques Global normalization – Divide channel value by means Control spots – Common spots in both channels – House keeping genes – Ratio of intensity of same gene in two channel is used for correction Iterative linear regression Parametric nonlinear nomalization – log(CY3/CY5) vs log(CY5)) – Fitted log ratio – observed log ratio General Non Linear Normalization – LOESS – curve between log(R/G) vs log(sqrt(R.G))
Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations
Cluster analysis Used to find groups of objects when not already known “Unsupervised learning” Associated with each object is a set of measurements (the feature vector) Aim is to identify groups of similar objects on the basis of the observed measurements
Unsupervised Learnning Hierarchical clustering: merging two branches at the time until all vari-ables(genes) are in one tree. [it does not answer the question of “howmany gene clusters there are”?] K-mean clustering: assuming there are K clusters. [what if this assumption is incorrect?] Self Organizing Maps (SOM) – Split all genes into similar sub-groups – Finds its own groups (machine learning) Principle Component – every gene is a dimension (vector), find a single dimension that best represents the differences in the data Model-based clustering: the number of clusters is determined dynamically [could be one of the most promising methods]
Average linkage hierarchical clustering, melanoma only unclustered ‘cluster’
Supervised Analysis Fisher’s linear discriminant analysis Quadratic discriminant analysis Logistic regression (a linear discriminant analysis) Neural networks Support vector machine
Example: Tumor Classification Reliable and precise classification essential for successful cancer treatment Current methods for classifying human malignancies rely on a variety of morphological, clinical and molecular variables Uncertainties in diagnosis remain; likely that existing classes are heterogeneous Characterize molecular variations among tumors by monitoring gene expression (microarray) Hope: that microarrays will lead to more reliable tumor classification (and therefore more appropriate treatments and better outcomes)
Higher Level Microarray data analysis Clustering and pattern detection Data mining and visualization Controls and normalization of results Statistical validatation Linkage between gene expression data and gene sequence/function/metabolic pathways databases Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments