A New Biclustering Algorithm for Analyzing Biological Data

A New Biclustering Algorithm for Analyzing Biological Data PrashantPaymal Advisor: Dr. Hesham Ali

Introduction • Microarray technology use to study the expression of many genes at once • Large amount of data is produced in the microarray technology • Proper analysis of the data is important to get meaningful information from it • There is a need for new analysis techniques

Data Analysis • From data to knowledge • We need to process data by grouping and synthesizing information into a “big picture” based upon characteristics and relationships • One of the most used analysis technique is traditional clustering

Traditional Clustering • Applied to either rows or columns of the data matrix separately • Each gene is defined using all the conditions • Each condition is characterized by the activity of all the genes that belong to it Genes Genes Conditions Conditions

Motivation • The large amount of data provide us great challenges of analysis • Clustering algorithms consider all the conditions to group genes and all the genes to group conditions • Biologically data may not show similar behavior in all conditions but in a subset of them • Traditional clustering algorithms will very likely miss some important information

Biclustering • The term “Biclustering” was first used by Cheng and Church in gene expression data analysis [Year 2000] • Clusters do not need to include all parameters (genes in Bioinformatics) for all conditions • Data Matrix • Each gene – One row • Each condition – One column • Each element – expression level of a gene under specific condition

Biclustering (Cont.) • Performs clustering in these two dimensions simultaneously • Each gene is selected using only a subset of the conditions • Each condition is selected using only a subset of the genes Genes Conditions

Goal of Biclustering • To identify subgroups of genes and subgroups of conditions by performing simultaneous clustering of both rows and columns of the gene expression matrix, instead of clustering these two dimensions separately • To find biclusters is NP-hard problem: It is actually a generalized version of traditional clustering

Previous Work • A systematic comparison and evaluation of biclustering methods for gene expression data - AmelaPrelic(2006) • Algorithms: • Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA) • Order Preserving Submatrix Algorithm (OPSM) • Iterative Signature Algorithm (ISA) • Cheng and Church algorithm • xMotif • Bimax

Previous Work (Cont.) • Comparative Analysis of Biclustering Algorithms – DorukBozdag … (2010) • Algorithms • Correlated Pattern Bicluster Algorithm (CPB) • Cheng and Church Algorithm • Order Preserving Submatrix Algorithm (OPSM) • HARP Algorithm • Minimum Sum-Squared Residue-based CoClustering Algorithm (MSSRCC) • Statistical Algorithmic Method for Biclustering Analysis Algorithm (SAMBA)

The Importance of Assessment • Different algorithms give different solutions for same data • There is no agreed upon guideline for choosing among them • Validation Techniques • External Validation Measures • Evaluate a result based on the knowledge of the correct class labels • Internal Validation Measures • Evaluate a result based on the information intrinsic to the data alone

Validation • In most biclustering papers external validation measures used to assess the methods, • It is not clear how to extend notions such as homogeneity and separation to the biclustering context (Gat-Viks et al 2003) • Internal measures don’t work well in case of biclustering due to which Gat-Viks et al 2003 and Handl et al 2005 recommend external measures

Objectives of the Project • Comprehensive Assessment Technique • Internal measures as well as external measures • Customized Biclustering Method • Input domain

Validation using Synthetic Data • Testing using Manufactured data • The portion of the implanted bicluster the algorithm was able to return • The portion external or irrelevant to the implanted bicluster which algorithm returns • Two metrics to evaluate cluster quality • U: Uncovered portion of the implanted bicluster • E: Portion of the output cluster external to the implanted bicluster

Validation using Synthetic Data • Testing using real (domain specific) data – for example using Gene match score • M1, M2 be two sets of Biclusters • Average of the maximum match scores for all biclusters in M1 with respect to the bicluster in M2 • Potential improvements • Don’t consider samples / conditions • Specificity and Sensitivity

Proposed Assessment • Calculate sensitivity and specificity scores • Specificity: proportion of negatives which are correctly identified • Sensitivity: proportion of actual positives which are correctly identified • Improve existing measures: • Average of the maximum match scores for all bi-clusters in M1 with respect to bi-clusters in M2 (considering both genes and samples) • Assessment based on knowledge of domain data • The resulting biclusters were evaluated based on the enrichment of Gene Ontology (GO) terms

Experiments • Given two biclustering results • M1: Result of a biclustering algorithm • M2: True Result • (G1, C1) M1 and (G2, C2) M2 • Calculate similarity score (Jaccard Coefficient) • and • Calculate the two scores, • Score 1: % of result of an algorithm is included in the true result • Score 2: % of true result an algorithm can find

Results • Synthetic Data: 100 genes and 100 samples • 10 implanted biclusters of each size 10 X 10 (10 genes and 10 samples) • Used publically available different biclustering algorithm implementations • Score 1: % of result of an algorithm is included in the true result • Score 2: % of true result an algorithm can find

Conclusion • Traditional Clustering is too restrictive technique for analyzing datasets in various application domains • We need new flexible analysis technique like biclustering to deal with possible imperfections in the input datasets • Assessment of data analysisis critical and must be considered while selecting the right tool for each application domains • Biclustering represents a powerful tool for analysis of data in a variety of domains and can be applicable to datasets other than biology

References • Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: A survey • AmelaPrelic et al: A systematic comparison and evaluation of biclustering methods for gene expression data • http://cheng.ececs.uc.edu/biclustering • http://www.tik.ethz.ch/~sop/bicat/ • http://acgt.cs.tau.ac.il/expander/

Thank you…

A New Biclustering Algorithm for Analyzing Biological Data

A New Biclustering Algorithm for Analyzing Biological Data

Presentation Transcript

Gibbs biclustering of microarray data

KLT, a new algorithm for SETI

Biclustering of Expression Data

Analyzing data

Analyzing data

Analyzing Data

A New Algorithm for 3D Isovist

A new algorithm for bidirectional deconvolution

Z34Bio: A Framework for Analyzing Biological Computation

Analyzing Biological Data

Hierarchical Quorum Consensus: A New Algorithm for Managing Replicated Data

Analyzing Data

Analyzing Data

Analyzing Data

Analyzing Data

Analyzing Data

Gibbs biclustering of microarray data

A New Algorithm for Hiding Data Using Image Based Steganography

Analyzing Data

Analyzing Data