540 likes | 645 Views
Machine learning. Learning mode on. Bioinfo is great. Clustering. Clustering (of expression data). UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree. It was suggested by Sokal and Michener (1958). Clustering (of expression data).
E N D
Machine learning Learning mode on. Bioinfo is great.
Clustering (of expression data) UPGMA is one such direct method, receiving as input a distance matrix and giving as output an ultrametric tree. It was suggested by Sokal and Michener (1958).
Clustering (of expression data) Often, there is a one-to-one transformation between the data and points in space. For example, expression of all genes under a specific condition is a point: (5,7,2,…, 54) a point in a space of dimension 20,000.
Clustering (of expression data) Another example, each expression profile is a point in a space whose dimension is the number of conditions (50,20,4,33) a point in a space of dimension 4
In space Condition 2 g1 Condition 1
Our goal will be to cluster Condition 2 Condition 1 Genes that are in the same cluster (show similar patterns of expression) are likely to be functionally related.
Distance between two expression profiles The Euclidian distance =
Distance between two expression profiles We can compute the distances between each pair of expression profiles and obtain a distance table.
UPGMA The UPGMA clustering algorithm Input: a distance matrix D which is symmetric, i.e., D(i,j)=D(j,i). Variables: for each group of clustered genes we give a number which indicates how many genes are in this group. N(i) will indicate the number of genes in group i. Initially, all groups have n=1.
UPGMA • The algorithm: • Find the i and j that have the smallest D(i,j) • Create a new group (ij) which has n(ij)=n(i)+n(j) • Connect i and j to a new node (which corresponds to the new group (ij)). Give the two branches connecting i to (ij) and j to (ij) each length of D(i,j)/2.
UPGMA • The algorithm: • 4. Compute the distance between the new group and all other groups (except for i and j) by using:
UPGMA • The algorithm: • 5. Delete the columns and rows of the data (modified input) matrix that correspond to groups i and j, and add a column and row for group (ij). • 6. Go to step 1, unless there is only 1 item left in the data matrix.
UPGMA Make sure, that in the resulting tree, the distance between each node to all the leaves descending from it is equal. 3 9 6 5 1 1
Starting tree Distance between g5 and g6 was 24, so each branch has a length of 12. g56 12 12 g5 g6 We call the father node of g5 and g6 -- “g56”.
Removing the g5 and g6 rows and columns, and adding the g56 row and column
Computing distances Here, i=g5, j=g6, k = g1. n(i)=n(j)=1. D(g56,g1) = 0.5D(g5,g1) + 0.5D(g6,g1) = 49.
g56 12 12 g5 g6 Building the tree - Continued Distance between g2 and g3 was 26, so each branch has a length of 13. g23 13 13 g2 g3 We call the father node of g2 and g3 -- “g23”.
Computing distances Here, i=g2, j=g3, k = g56. n(i)=n(j)=1. D(g23,g56) = 0.5D(g2,g56)+0.5D(g3,g56)=37.5.
g56 12 12 g5 g6 Tree Distance between g23 and g56 was 37.5, so each branch has a length of 18.75. But this is the distance from g2356 to the leaves. The distance g2356 to g56 is 18.75-12=6.75. The distance between g2356 to g23 is 18.75-13=5.75 g2356 6.75 5.75 g23 13 13 g2 g3
Computing distances Here, i = g23, j = g56, k = g1. n(i)=n(j)=2. D( g2356 , g1 ) = 0.5D( g23 , g1 ) + 0.5D( g56 , g1 ) = 44.5
Building the tree Distance between g2356 and g4 was 39.5, so g23456 is mapped to the line 19.75. The distance to g2356, is thus, 1 g23456 g2356 19.75 18.75 g23 13 g56 12 0 g4 g5 g6 g2 g3
Computing distances Here, i = g2356, j = g4, k = g1. n(i)=4, n(j)=1. D( g23456 , g1 ) = 0.8D( g2356 , g1 ) + 0.2D( g4 , g1 )= 44.5*8/10+51*2/10 = (356+102)/10=45.8
Constructing the tree Distance between g23456 and g1 was 45.8, so g123456 is mapped to the line 22.9 The distance to g23456, is thus, 3.15 g123456 22.9 g23456 g2356 19.75 18.75 g23 13 g56 12 0 g4 g5 g6 g2 g3 g1
Reconstructing the tree Distance between g123456 and g7 was 89.833, so g1234567 is mapped to the line 44.9165The distance to g123456, is thus, 22.0165 g1234567 44.9165 g123456 22.9 g23456 g2356 19.75 18.75 g23 13 g56 12 0 g7 g4 g5 g6 g2 g3 g1
Resulting tree 72.14 Distance between g1234567 and g8 was 144.2857, so the ROOT is mapped to the line 72.14285The distance to g1234567, is thus, 27.22635 g1234567 44.9165 g123456 22.9 g23456 g2356 19.75 18.75 g23 13 g56 12 0 g7 g4 g5 g6 g2 g3 g1 g8
From tree to clusters If we want two clusters, we will cut here, and obtain g8 versus g1-7. g7 g4 g5 g6 g2 g3 g1 g8
From tree to clusters If we want 3 clusters, we will cut here, and obtain g8,g7, and g1-6. g7 g4 g5 g6 g2 g3 g1 g8
From tree to clusters The 4 clusters are: g8,g7,g1,g23456 g7 g4 g5 g6 g2 g3 g1 g8
Classification Gene 2 ? Gene 1 If red = brain tumor and yellow healthy – do I have a brain tumor?
SVM = support vector machine Gene 2 ? Gene 1 In SVM we find a (hyper)plane that divides the space in two.
SVM – confidence in classification Gene 2 ? Gene 1 The further the point is from the separating (hyper)plane, the more confident we are in the classification
SVM – cannot always perfectly classify Gene 2 ? Gene 1 Sometimes we cannot perfectly separate the training data. In this case, we will find the best separation.
KNN = k nearest neighbors Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor? Gene 1 KNN is another method for classification. For each point it looks at its k nearest neighbors.
KNN = k nearest neighbors Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor? Gene 1 For each point it looks at its k nearest neighbors. For example, the method with k=3 looks at points 3 nearest neighbors to decide how to classify it. If the majority are “Red” it will classify the point as red.
KNN = k nearest neighbors Gene 2 ? If red = brain tumor and yellow healthy – do I have a brain tumor? Gene 1 KNN is better than SVM for the above case.
Gene 2 ? Gene 1 KNN - exercise In the above example – how will the point be classified in KNN with K=1? In SVM?
Training dataset Gene 2 ? Gene 1 The red and yellow points are used to train the classifier. The more training data one has -> the better the classifier will perform.
Test dataset Gene 2 ? Gene 1 Usually some points for which we know the answer are not given to the classifier and are used to TEST its performance.
Decision tree Gene 2 high low Age >40 Operation = no Yes No Operation = no Operation = yes Decision trees are automatically built from “train data” and are used for classification. They also tell us which features are most important.
Voting Training data that need a classification algorithm (Yes/No) Train: KNN SVM Decision trees New datum (Test) Yes No Yes Voting uses an array of machine learning algorithms and chooses the classification suggested by most classifiers. YES