1 / 41

Clustering Biological Data

Clustering Biological Data. Genes can be clustered according to their expression or their sequences. Clustering genes according to their expression is used for identifying - Gene Functions (common functions, genes that work together, genes that are mutually controlled by the same factor)

bryantt
Download Presentation

Clustering Biological Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Biological Data

  2. Genes can be clustered according to their expression or their sequences • Clustering genes according to their expression is used for identifying • - Gene Functions (common functions, genes that work together, • genes that are mutually controlled by the same factor) • Diagnostics • Clustering genes according to their sequence (DNA, RNA, protein) • Useful to build phylogenetic trees • - Study evolution • - Identify new species • - Gene Function • - Diagnostics

  3. Different clustering approaches • Supervised Methods(למידה מונחית) -Support Vector Machine (SVM) In class we will show an example of using SVM for diagnostics using expression data • Unsupervised (למידה בלתי מונחית) - Hierarchical Clustering - K-means In class we will learn a basic algorithm for Hierarchical Clustering of sequence data to build phylogenetic trees

  4. Clustering the genes according to expression Gene Cluster A set of genes that have a similar expression pattern across tissues High correlation/low Euclidian distance between the expression vectors within the cluster

  5. How can gene expression help in diagnostics?

  6. A molecular signature of metastasis in primary solid tumors Samples were taken from patients with adenocarcinoma. hundreds of genes that differentiate between cancer tissues in different stages of the tumor were found. The arrow shows an example of a tumor cells which were not detected correctly by histological or other clinical parameters. Ramaswamy et al, 2003 Nat Genet 33:49-54

  7. Different patients (BRCA1 or BRCA2) How can gene-expression help in diagnostics ? RESEARCH QUESTION Can we distinguish BRCA1 from BRCA2– cancers based solely on their gene expression profiles? Genes *** HERE we want to cluster the patients not the genes !!! ***The microarray figure is only for illustration and is not based on real data

  8. Supervised approachesfor diagnostic based on expression data Support Vector Machine SVM

  9. Different patients (BRCA1 or BRCA2)*** How can gene-expression help in diagnostics ? Genes DATA Microarray expression of all genes from two types of breast cancer patients (BRCA1 and BRCA2) ***The microarray figure is only for illustration and is not based on real data

  10. SVM would begin with a set of samples from patients which have been diagnosed as either BRCA1 (red dots) or BRCA2 (blue dots). Each dot represents a vector of the expression pattern taken from the microarray experiment of a patient.

  11. ? How do SVM’s work with expression data? The SVM is trained on data which was classified based on histology. After training the SVM to separated the BRCA1 from BRAC2 tumors given the expression data, we can then apply it to diagnose an unknown tumor for which we have the equivalent expression data .

  12. Unsupervised approachesfor building phylogenetic trees Hierarchical Clustering

  13. What are phylogenetic trees?

  14. Phylogenyis the inference of evolutionary relationships. Traditionally, phylogeny relied on the comparison of morphological features between organisms. Today, molecular sequence data are mainly used for phylogenetic analyses. One tree of life A sketch Darwin made soon after returning from his voyage on HMS Beagle (1831–36) showed his thinking about the diversification of species from a single stock (see Figure, overleaf). This branching, extended by the concept of common descent, Phylogeny in Greek =the origin of the tribe

  15. Classical Phylogenetic trees Modern Phylogenetic trees Pace (2001) Haeckel (1879)

  16. What can we learn from phylogenetics tree?

  17. Human Evolution Neanderthals Modern Man

  18. Help to find the relationship between the species and identify new species Metagenomics / Microbiome The aims of these fields is to study the genomes recovered from environmental samples . For example: - Study the ecology of a specific environment (sea) - Study the composion of bacteria in our guts

  19. Discover new species in our own gutThe total number of genes in the various species represented in our internal microbial communities (microbiome) likely exceeds the number of our human genes by at least two orders of magnitude. Suez et al, Nature 2014

  20. How to discover new species?

  21. Extracting Phylogenetic Trees of known species D B A C ? Finding relationships between the unknown and known species

  22. Phylogenetic Tree Terminology • Graph composed ofnodes &branches • Each branch connects two adjacent nodes R F Branch=קשת E Node= צומת D B A C

  23. Phylogenetic Tree Terminology Rooted tree Un-rooted tree Human Chicken Gorilla Chimp Gorilla Human Chimp Chicken

  24. Rooted vs. unrooted trees 3 1 2 3 1 2

  25. How can we build a tree with molecular data? -Trees based on DNA sequence (rRNA) -Trees based on Protein sequences atcgatcgtgatcgatcgtagcatcgatgcatcgtacg MWRCPYCGKRQWCMWG - Full genomes

  26. Basic hierarchical clustering algorithm for constructing a phylogenetic treeUnweighted Pair Group Method using Arithmetic Averages (UPGMA) Assumption: Distance of all nodes to root is equal Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACGCGTTGGGCGACGGTAAT Sequence c ACGCATTGAATGATGATAAT Sequence d ACACATTGAGTGTGATAATA a b c d In tutorial you will learn the Neighbor Joining (NJ)- algorithm which does not assume equal distance to root

  27. Moving from Similarity to Distance Sequences Distance Table Sequence a ACGCGTTGGGCGATGGCAAC Sequence b ACACATTGAGTGTGATCAAC Sequence c ACACATTGAGTGAGGACAAC Sequence d ACGCGTTGGGCGACGGTAAT Distances * Dab = 8 Dac = 7 Dad = 5 Dbc = 3 Dbd = 9 Dcd = 8 * Can be calculated using different distance metrics 27

  28. a b d c Constructing a tree starting from a STAR model Step 1:Choose the nodes with the shortest distance and fuse them. 28

  29. a a e c,b d Step 2: recalculate the distance between the rest of the remaining sequences (a and d) to the new node (e) and remove the fused nodes from the table. a c D (ea) = (D(ac)+ D(ab)-D(cb))/2 e d D (ed) = (D(dc)+ D(db)-D(cb))/2 b

  30. Step 3: In order to get a tree, un-fuse c and b by calculating their distance to the new node (e) a c Dce e d Dde b !!!The distances Dce and Dde are calculated assuming constant rate evolution (c and b are equally distant from the root) Will be taught in the tutorial

  31. a Next… We want to fuse the next closest nodes c Dce a,d e f Dde b

  32. Finally We need to calculate the distance between e and f c a Daf e f Dce Dbf Dde b d D (ef) = (D(ea)+ D(ed)-D(ad))/2

  33. a b d c From a Star to a tree f e b c a d

  34. Human Evolution Tree UPGMA Neighbor Joining

  35. The down side of phylogenetic trees - Using different regions from a same alignment may produce different trees.

  36. Problems with phylogenetic trees

  37. Problems with phylogenetic trees Bacillus Bacillus Burkholderias Aeromonas Aeromonas Pseudomonas Pseudomonas Burkholderias Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella Bacillus Pseudomonas Pseudomonas Aeromonas Burkholderias Burkholderias Aeromonas Bacillus Lechevaliera Lechevaliera E.coli E.coli Salmonella Salmonella

  38. Problems with phylogenetic trees • What to do ?

  39. Bootstrapping • We create new data sets by sampling N positions with replacement. • We generate 100 - 1000 such pseudo-data sets. • For each such data set we reconstruct a tree, using the same method. • We note the agreement between the tree reconstructed from the pseudo-data set to the original tree. • Note: we do not change the number of sequences !

  40. Bootstrapped tree Less reliable Branch Highly reliable branch

  41. Stimulating questions • Do DNA and proteins from the same gene produce different trees ? • Can different genes have different evolutionary history ?

More Related