230 likes | 447 Views
A New Biclustering Algorithm for Analyzing Biological Data. Prashant Paymal Advisor: Dr. Hesham Ali. Introduction. Microarray technology use to study the expression of many genes at once Large amount of data is produced in the microarray technology
E N D
A New Biclustering Algorithm for Analyzing Biological Data PrashantPaymal Advisor: Dr. Hesham Ali
Introduction • Microarray technology use to study the expression of many genes at once • Large amount of data is produced in the microarray technology • Proper analysis of the data is important to get meaningful information from it • There is a need for new analysis techniques
Data Analysis • From data to knowledge • We need to process data by grouping and synthesizing information into a “big picture” based upon characteristics and relationships • One of the most used analysis technique is traditional clustering
Traditional Clustering • Applied to either rows or columns of the data matrix separately • Each gene is defined using all the conditions • Each condition is characterized by the activity of all the genes that belong to it Genes Genes Conditions Conditions
Motivation • The large amount of data provide us great challenges of analysis • Clustering algorithms consider all the conditions to group genes and all the genes to group conditions • Biologically data may not show similar behavior in all conditions but in a subset of them • Traditional clustering algorithms will very likely miss some important information
Biclustering • The term “Biclustering” was first used by Cheng and Church in gene expression data analysis [Year 2000] • Clusters do not need to include all parameters (genes in Bioinformatics) for all conditions • Data Matrix • Each gene – One row • Each condition – One column • Each element – expression level of a gene under specific condition
Biclustering (Cont.) • Performs clustering in these two dimensions simultaneously • Each gene is selected using only a subset of the conditions • Each condition is selected using only a subset of the genes Genes Conditions
Goal of Biclustering • To identify subgroups of genes and subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression matrix, instead of clustering these two dimensions separately • To find biclusters is NP-hard problem: It is actually a generalized version of traditional clustering
Previous Work • A systematic comparison and evaluation of biclustering methods for gene expression data - AmelaPrelic(2006) • Algorithms: • Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA) • Order Preserving Submatrix Algorithm (OPSM) • Iterative Signature Algorithm (ISA) • Cheng and Church algorithm • xMotif • Bimax
Previous Work (Cont.) • Comparative Analysis of Biclustering Algorithms – DorukBozdag … (2010) • Algorithms • Correlated Pattern Bicluster Algorithm (CPB) • Cheng and Church Algorithm • Order Preserving Submatrix Algorithm (OPSM) • HARP Algorithm • Minimum Sum-Squared Residue-based CoClustering Algorithm (MSSRCC) • Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA)
The Importance of Assessment • Different algorithms give different solutions for same data • There is no agreed upon guideline for choosing among them • Validation Techniques • External Validation Measures • Evaluate a result based on the knowledge of the correct class labels • Internal Validation Measures • Evaluate a result based on the information intrinsic to the data alone
Validation • In most biclustering papers external validation measures used to assess the methods, • It is not clear how to extend notions such as homogeneity and separation to the biclustering context (Gat-Viks et al 2003) • Internal measures don’t work well in case of biclustering due to which Gat-Viks et al 2003 and Handl et al 2005 recommend external measures
Objectives of the Project • Comprehensive Assessment Technique • Internal measures as well as external measures • Customized Biclustering Method • Input domain
Validation using Synthetic Data • Testing using Manufactured data • The portion of the implanted bicluster the algorithm was able to return • The portion external or irrelevant to the implanted bicluster which algorithm returns • Two metrics to evaluate cluster quality • U: Uncovered portion of the implanted bicluster • E: Portion of the output cluster external to the implanted bicluster
Validation using Synthetic Data • Testing using real (domain specific) data – for example using Gene match score • M1, M2 be two sets of Biclusters • Average of the maximum match scores for all biclusters in M1 with respect to the bicluster in M2 • Potential improvements • Don’t consider samples / conditions • Specificity and Sensitivity
Proposed Assessment • Calculate sensitivity and specificity scores • Specificity: proportion of negatives which are correctly identified • Sensitivity: proportion of actual positives which are correctly identified • Improve existing measures: • Average of the maximum match scores for all bi-clusters in M1 with respect to bi-clusters in M2 (considering both genes and samples) • Assessment based on knowledge of domain data • The resulting biclusters were evaluated based on the enrichment of Gene Ontology (GO) terms
Experiments • Given two biclustering results • M1: Result of a biclustering algorithm • M2: True Result • (G1, C1) M1 and (G2, C2) M2 • Calculate similarity score (Jaccard Coefficient) • and • Calculate the two scores, • Score 1: % of result of an algorithm is included in the true result • Score 2: % of true result an algorithm can find
Results • Synthetic Data: 100 genes and 100 samples • 10 implanted biclusters of each size 10 X 10 (10 genes and 10 samples) • Used publically available different biclustering algorithm implementations • Score 1: % of result of an algorithm is included in the true result • Score 2: % of true result an algorithm can find
Conclusion • Traditional Clustering is too restrictive technique for analyzing datasets in various application domains • We need new flexible analysis technique like biclustering to deal with possible imperfections in the input datasets • Assessment of data analysisis critical and must be considered while selecting the right tool for each application domains • Biclustering represents a powerful tool for analysis of data in a variety of domains and can be applicable to datasets other than biology
References • Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: A survey • AmelaPrelic et al: A systematic comparison and evaluation of biclustering methods for gene expression data • http://cheng.ececs.uc.edu/biclustering • http://www.tik.ethz.ch/~sop/bicat/ • http://acgt.cs.tau.ac.il/expander/