260 likes | 456 Views
Lecture 5 MicroArray clustering and data analysis. BIO454 Dr. Alaa eldin Abdallah yassin. Outline. Microarray data analysis. Concatemer method. Supervised and unsupervised learning. What is clustering technique? Clustering technique and Microarray data analysis.
E N D
Lecture 5MicroArrayclustering and data analysis BIO454 Dr. Alaa eldin Abdallah yassin
Outline • Microarray data analysis. • Concatemer method. • Supervised and unsupervised learning. • What is clustering technique? • Clustering technique and Microarray data analysis. • Hierarchical Clustering. • K-means clustering
Microarray data analysis. • Microarray data analysis is the final step in reading and processing data produced by a microarray chip. • Microarray analysis techniques are used in interpreting the data generated from experiments on DNA (Gene chip analysis). • Data in such large quantities is difficult - if not impossible - to analyze without the help of computer programs. • Most microarray manufacturers, provide commercial data analysis software alongside their microarray products. • There are also open source options that utilize a variety of methods for analyzing microarray data.
Microarray data analysis... • A common method for evaluating how well normalized an array is, to plot an MA Plot of the data. MA plots can be produced using programs and languages such as R, MATLAB, and Excel. • MA Plot : is an application for visual representation of genomic data. • The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average )scales, then plotting these values.
Expression Data Matrix • Gene expression data are usually presented in an expression matrix. Each column represents all the gene expression levels from a single experiment, and each row represents the expression of a gene across all experiments. Each element is a log ratio. The log ratio is defined as log2 (T/R), where T is the gene expression level in the testing sample, R is the gene expression level in the reference sample.
The expression matrix can be presented as a matrix of colored rectangles. Each rectangle represents an element of the expression matrix
Supervised and unsupervised learning • Supervised. • Supervised learning technique deals with the labelled data where the output data patterns are known to the system. • Unsupervised. • unsupervised learning generates moderate but reliable results • Clustering is one of the unsupervised approaches
Supervised learning • Suppose you had a basket and it is filled with some fruits your task is to arrange the same type fruits at one place. • suppose the fruits are apple, banana, cherry, grape. • so you already know from your previous work that, the shape of each and every fruit so it is easy to arrange the same type of fruits at one place. • here your previous work is called as train data in data mining. • so you already learn the things from your train data, This is because of you have a response variable which says you that if some fruit have so and so features it is grape, like that for each and every fruit. • This type of data you will get from the train data. • This type solving problem come under Classification ( supervised learning ). • So you already learn the things so you can do your job confidently.
Un Supervised learning • Suppose you had a basket and it is filled with some fruits your task is to arrange the same type fruits at one place. • This time you don't know any thing about that fruits, you are first time seeing these fruits so how will you arrange the same type of fruits. • What you will do first you take on fruit and you will select any physical character of that particular fruit, suppose you taken colors. • Then the groups will be some thing like this. • RED COLOR GROUP: apples & cherry fruits, GREEN COLOR AND SMALL SIZE: grapes. • This type of learning is know unsupervised learning.
Aim Of Supervised Learning • The aim of supervised, machine learning is to build a model that makes predictions based on evidence in the presence of uncertainty. • As adaptive algorithms identify patterns in data, a computer "learns" from the observations. When exposed to more observations, the computer improves its predictive performance.
Clustering technique • Clustering is one of the unsupervised approaches. • Clustering is a data mining technique used to group genes having similar expression patterns. Hierarchical clustering, and k-means clustering are widely used techniques in microarray analysis. • Clustering genes or samples with similar expression profiles together. • So Clustering is the process of partitioning the data (or objects) into the same class, The data in one class is more similar to each other than to those in other cluster.
Clustering technique and Microarray data analysis • Clustering is used to build groups of genes with related expression patterns (also known as co-expressed genes) as in HCS clustering algorithm, Sequence clustering is used to group homologous sequences into gene families. • Clustering analysis is commonly used for interpreting microarray data. • It provides both a visual representation of complex data and a method for measuring similarity between experiments (gene ratios). • The widely used methods for clustering microarray data are: Hierarchical, K-means and Self-organizing map
Hierarchical Clustering • Hierarchical Clustering is the most popular method for gene expression data analysis. • In hierarchical clustering, genes with similar expression patterns are grouped together and are connected by a series of branches (clustering tree or dendrogram). • Experiments with similar expression profiles can also be grouped together using the same method. • Hierarchical clustering consists of two separate phases: • Initially, a distance matrix containing all the pairwise distances between the genes is calculated (Euclidian method, …). • After calculation of the initial distance matrix, the hierarchical clustering algorithm is applied.
Hierarchical Clustering • The two questions: • How to determine the similarity between two genes? • How to determine the similarity between two cluster? To answer the first question, we calculate the distance between two expression vectors. A Gene Expression Vector consists of the expression of a gene over a set of experimental conditions.?
Distance Measures • Some Techniques to measure distance : • Euclidean distance • Pearson Correlation Coefficient (PCC) • Rank Correlation Coefficient (RCC)
cluster-to-cluster distance • The second question is: How to determine the similarity between clusters? The method for determining cluster-to-cluster distance is called linkage method. • Three linkage methods: • Single linkage. • Complete linkage. • Average linkage