Clustering Biological Data

Clustering Biological Data

Genes can be clustered according to their expression or their sequences • Clustering genes according to their expression is used for identifying • - Gene Functions (common functions, genes that work together, • genes that are mutually controlled by the same factor) • Diagnostics • Clustering genes according to their sequence (DNA, RNA, protein) • Useful to build phylogenetic trees • - Study evolution • - Identify new species • - Gene Function • - Diagnostics

Different clustering approaches • Supervised Methods(למידה מונחית) -Support Vector Machine (SVM) In class we will show an example of using SVM for diagnostics using expression data • Unsupervised (למידה בלתי מונחית) - Hierarchical Clustering - K-means In class we will learn a basic algorithm for Hierarchical Clustering of sequence data to build phylogenetic trees

Clustering the genes according to expression Gene Cluster A set of genes that have a similar expression pattern across tissues High correlation/low Euclidian distance between the expression vectors within the cluster

How can gene expression help in diagnostics?

A molecular signature of metastasis in primary solid tumors Samples were taken from patients with adenocarcinoma. hundreds of genes that differentiate between cancer tissues in different stages of the tumor were found. The arrow shows an example of a tumor cells which were not detected correctly by histological or other clinical parameters. Ramaswamy et al, 2003 Nat Genet 33:49-54

Different patients (BRCA1 or BRCA2) How can gene-expression help in diagnostics ? RESEARCH QUESTION Can we distinguish BRCA1 from BRCA2– cancers based solely on their gene expression profiles? Genes *** HERE we want to cluster the patients not the genes !!! ***The microarray figure is only for illustration and is not based on real data

Supervised approachesfor diagnostic based on expression data Support Vector Machine SVM

Different patients (BRCA1 or BRCA2)*** How can gene-expression help in diagnostics ? Genes DATA Microarray expression of all genes from two types of breast cancer patients (BRCA1 and BRCA2) ***The microarray figure is only for illustration and is not based on real data

SVM would begin with a set of samples from patients which have been diagnosed as either BRCA1 (red dots) or BRCA2 (blue dots). Each dot represents a vector of the expression pattern taken from the microarray experiment of a patient.

? How do SVM’s work with expression data? The SVM is trained on data which was classified based on histology. After training the SVM to separated the BRCA1 from BRAC2 tumors given the expression data, we can then apply it to diagnose an unknown tumor for which we have the equivalent expression data .

Unsupervised approachesfor building phylogenetic trees Hierarchical Clustering

What are phylogenetic trees?

Phylogenyis the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses. One tree of life A sketch Darwin made soon after returning from his voyage on HMS Beagle (1831–36) showed his thinking about the diversification of species from a single stock (see Figure, overleaf). This branching, extended by the concept of common descent, Phylogeny in Greek =the origin of the tribe

Classical Phylogenetic trees Modern Phylogenetic trees Pace (2001) Haeckel (1879)

What can we learn from phylogenetics tree?

Human Evolution Neanderthals Modern Man

Help to find the relationship between the species and identify new species Metagenomics / Microbiome The aims of these fields is to study the genomes recovered from environmental samples . For example: - Study the ecology of a specific environment (sea) - Study the composion of bacteria in our guts

Discover new species in our own gutThe total number of genes in the various species represented in our internal microbial communities (microbiome) likely exceeds the number of our human genes by at least two orders of magnitude. Suez et al, Nature 2014

How to discover new species?

Extracting Phylogenetic Trees of known species D B A C ? Finding relationships between the unknown and known species

Phylogenetic Tree Terminology • Graph composed ofnodes &branches • Each branch connects two adjacent nodes R F Branch=קשת E Node= צומת D B A C

Phylogenetic Tree Terminology Rooted tree Un-rooted tree Human Chicken Gorilla Chimp Gorilla Human Chimp Chicken

Rooted vs. unrooted trees 3 1 2 3 1 2

How can we build a tree with molecular data? -Trees based on DNA sequence (rRNA) -Trees based on Protein sequences atcgatcgtgatcgatcgtagcatcgatgcatcgtacg MWRCPYCGKRQWCMWG - Full genomes

Basic hierarchical clustering algorithm for constructing a phylogenetic treeUnweighted Pair Group Method using Arithmetic Averages (UPGMA) Assumption: Distance of all nodes to root is equal Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACGCGTTGGGCGACGGTAAT Sequence c ACGCATTGAATGATGATAAT Sequence d ACACATTGAGTGTGATAATA a b c d In tutorial you will learn the Neighbor Joining (NJ)- algorithm which does not assume equal distance to root

Moving from Similarity to Distance Sequences Distance Table Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACACATTGAGTGTGATCAAC Sequence c ACACATTGAGTGAGGACAAC Sequence d ACGCGTTGGGCGACGGTAAT Distances * Dab = 8 Dac = 7 Dad = 5 Dbc = 3 Dbd = 9 Dcd = 8 * Can be calculated using different distance metrics 27

a b d c Constructing a tree starting from a STAR model Step 1:Choose the nodes with the shortest distance and fuse them. 28

a a e c,b d Step 2: recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodes from the table. a c D (ea) = (D(ac)+ D(ab)-D(cb))/2 e d D (ed) = (D(dc)+ D(db)-D(cb))/2 b

Step 3: In order to get a tree, un-fuse c and b by calculating their distance to the new node (e) a c Dce e d Dde b !!!The distances Dce and Dde are calculated assuming constant rate evolution (c and b are equally distant from the root) Will be taught in the tutorial

a Next… We want to fuse the next closest nodes c Dce a,d e f Dde b

Finally We need to calculate the distance between e and f c a Daf e f Dce Dbf Dde b d D (ef) = (D(ea)+ D(ed)-D(ad))/2

a b d c From a Star to a tree f e b c a d

Human Evolution Tree UPGMA Neighbor Joining

The down side of phylogenetic trees - Using different regions from a same alignment may produce different trees.

Problems with phylogenetic trees

Problems with phylogenetic trees Bacillus Bacillus Burkholderias Aeromonas Aeromonas Pseudomonas Pseudomonas Burkholderias Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella Bacillus Pseudomonas Pseudomonas Aeromonas Burkholderias Burkholderias Aeromonas Bacillus Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella

Problems with phylogenetic trees • What to do ?

Bootstrapping • We create new data sets by sampling N positions with replacement. • We generate 100 - 1000 such pseudo-data sets. • For each such data set we reconstruct a tree, using the same method. • We note the agreement between the tree reconstructed from the pseudo-data set to the original tree. • Note: we do not change the number of sequences !

Bootstrapped tree Less reliable Branch Highly reliable branch

Stimulating questions • Do DNA and proteins from the same gene produce different trees ? • Can different genes have different evolutionary history ?

Clustering Biological Data

Clustering Biological Data

Presentation Transcript

Data Mining: Clustering

Data Mining--Clustering

Clustering Data Streams

Clustering Data Streams

Data Stream Clustering

BIOLOGICAL Data Mining

Biological Data Integration

Clustering Uncertain Data

Data Clustering Methods

Biological Data - Redwoods

Analyzing Biological Data

Data Clustering

Clustering microarray data

Biological Data Mining

Biological Data Mining

Data Clustering

Biological Data Mining

Clustering Techniques for Finding Patterns in Large Amounts of Biological Data

Biological Data Mining

Clustering Categorical Data