380 likes | 513 Views
University of Nebraska at Omaha Innovative Database Models and Advanced Tools in Bioinformatics. Hesham H. Ali UNO Bioinformatics Research Group Department of Computer Science College of Information Science and Technology. Key Challenges Facing Bioinformatics Research.
E N D
University of Nebraska at OmahaInnovative Database Models and Advanced Tools in Bioinformatics Hesham H. Ali UNO Bioinformatics Research Group Department of Computer Science College of Information Science and Technology
Key Challenges Facing Bioinformatics Research • Significant gaps between tool developers and tool users • Different objectives • Different funding agencies • Different academic cultures • Significant problems with available Biological Data • Archival based • Lack of structure
Problems with Current Biological Data • The availability of large biological data and the increasing rate in producing new data, available in public data banks or via microarray data • The increasing pressure to maximize the use of the available data, particularly to impact key related industries (biotech companies, biotech drugs) • The large degree of heterogeneity of the available data in terms of quality
UNO Bioinformatics Research Group • Group Triangle • Research motivated by real biological problems • Innovative Database Models • Advanced tools
Biological Questions Addresses by our Group • Molecular diagnosis - Identification • Sequence based id • Enzyme (cutting order) based id • Instrumentation (Mass Spec, WAVE) based id • Basic Molecular Biology - Gene regulation • Microarray Analysis • Motif discovery/searching • Epidemiology and Clinical Research • Patient tracking system • Clinical expert system
Bioinformatics Solutions to These Problems • Develop new inventive database models • Custom database for specific domains • Centralized Structured integrated data • Develop innovative Bioinformatics tools • Clustering algorithms • Advanced motif finding approaches
Database Models • Customized (Private) Solution: Custom based Data Base Model High degree of quality and consistency • Centralized (Public) Solution: New Curated Integrated and Structured DataBase Model
Model One: Custom Databases • Allowing researchers to create custom sets of genetic data suited to their specific needs. • Allowing researchers to control the quality of genetic data in their custom data sets through fine-tuning parameters. • Searching data using optimal alignment algorithms, rather than using heuristic methods. • Giving researchers/clinicians the ability to formulate sequence identification concepts and test their ideas against a validated database • Incorporating information from GenBank if needed
The Sequence Identification Problem • Identification of organisms using obtained sequences is a very important problem • Relying on wet lab methods only is not enough • Employing identification algorithms using signature motifs to complement the experimental approaches • Currently, no robust software tool is available for aiding researchers and clinicians in the identification process • Such a tool would have to utilize biological knowledge and databases to identify sequences • Issues related to size of data and quality of data are suspect and would need to be dealt with
Nebraska gets its very own organism • While trying to pinpoint the cause of a lung infection in local cancer patients, they discovered a previously unknown micro-organism. And they've named it "mycobacterium nebraskense," after the Cornhusker state. • It was discovered few weeks ago using Mycoalign: A Bioinformatics program developed at PKI Source: Omaha World Herald, March 21, 2005
Model Two: Centralized Database - the Integrated Model A new integrated model based on: • Organized and curated database • True non-redundancy by having one record for each polymorphic set with pointer to the rest of the set if needed • Allowing advanced queries • Being user-friendly and employing true automation • Employing various algorithms with different levels of accuracy and speed for conducting homology searches.
The Clean Gene Package • A set of integrated database and alignment tools: • Edited and curated • Web based • Of manageable size • Based on hierarchical database model • Utilize various alignment algorithms • Allows advanced automated queries • Allows fast and accurate searches
The Key Challenges • The New Structured relational database model • Identification of equivalence classes of records (polymorphic sets) • Identification of a good representative for each set • Curation and classification • Accurate annotation • Advanced data mining tools • A user-friendly interface that employs true automation for interfacing with the database
Tool I: Clustering Biological Data • Clustering is a fundamental technique in finding a structure in a collection of unlabeled data. • Basically, clustering is the process of organizing objects into groups whose members are similar in some way. • A good Clustering tool is a key component in analyzing microarray data
Message Passing Clustering (MPC) • Inspired by real-world situations: elements with similar attributes cluster together simultaneously • Advantages: • Easy to understand and use. • Taking the advantage of communication among data objects, MPC is able to balance the global and local structure and be performed in parallel. • “Message” has flexible structure which allows further development to fit to different research interests. • We have extended the basic MPC to • Weighted MPC • Stochastic MPC • Semi-supervised MPC
Basic MPC • The phylogenetic trees of Mycobacterium (9 species, 34 strains), constructed by the Neighbor Joining and MPC method. a. NJ b. MPC
Weighted MPC (WMPC)—with Adaptive Feature Scaling • Add weight associated with each cluster-feature pair. A single feature have multiple weights in different clusters and, in one cluster, all features may have different weights. • Update the weights during the clustering process. If on some dimension, the similarity between two going-to-merge clusters is high (/low), then we increase (/decrease) the weight on that dimension in the newly merged cluster. • Test WMPC on Colon Cancer data (2000 genes in 40 tumor and 22 • normal samples), giving higher classification rate. • Two benefits: • Strengthen the signal features while reducing the noise features, so making clustering results more accurate. • More importantly, reveal the contribution of the features (genes) to the clusters (samples), so that identify the set of genes responsible for certain diseases.
Chance to merge? Kick out ? Tie ? Target Object b e d f a c distance 0 Stochastic MPC (SMPC)Based on Kernel Functions Probability Density Estimates Based on Little Gaussian Kernel Functions Kernel Density Estimates Using Gaussian Kernels
Semi-supervised MPC • Clustering methods are considered unsupervised, meaning that the reduction is derived solely from the data rather than reflecting any previous knowledge. • Classification methods are considered supervised, because in the training phase, samples classes are already known, and we classify the objects into known groups. • Between clustering and classification: Unlabeled data with prior knowledge, such as constraints and hypotheses. • The goal of semi-supervised clustering is to guide the clustering, using the prior knowledge, to get better partitions.
Semi-supervised MPC Instance-level Constraints • Colon Cancer data (2000 genes in 40 tumor and 22 normal samples). • We cluster samples with genes as features. Since the samples (instances) labels (constrains) are known, it is call instance-level constraints. We want to see how well our method could separate the normal and tumor tissues based on different numbers of known labels for the samples as prior knowledge. OP: Output partition after clustering. IP: Input constraints presented before clustering. Combining the power of clustering with background information achieves better performance than either in isolation.
Semi-supervised MPC Attribute-level Constraints
Gene clusters illustrating differentially expressed genes in tumor and normal samples a. Cluster 6 b. Cluster 8
Generalizations • WMPC extends the unweighted MPC to the weighted MPC. • If we initialize all entries in w to be 1 and never change the weights, MPC-AFS becomes a regular MPC. • SMPC extends the deterministic MPC to the stochastic MPC. • If we choose the particular kernel function (rectangular) and the particular bandwidth parameter (the minimum distance between the target cluster and all the others) to estimate the probability, SMPC can be reduced to a regular MPC . • Semi-supervised MPC extends unsupervised MPC to somehow supervised MPC. • Unsupervised MPC can be considered as a special case of semi-supervised MPC with null background info and constraints.
Tool II: Motif Finding/Data Mining Tool • Given a set of known binding sites, develop a representation of these binding sites that can be used to search for additional instances of those binding sites in the genome. • Given a set of sequences known to be co-regulated (i.e. by an expression array) determine the binding locations in the sequence and determine a representation for binding specificity.
Motif Representations • Static Sequences: tataat • Regular Expressions (RegEx): tat[at].t • Sequences with N errors: tataat:2 • RegEx with N errors: tat[at].t:2 • Mononucleotide Scoring Matrices: • Dinucleotide Scoring Matrices (HMMs) • Multinucleotide Scoring Matrices a t [at] . t t a t g c 1 2 3 4 5 6
Searching for Known Motifs 1. Obtain a multiple sequence alignment of known motifs (e.g. from gel shift assay) a t [at] . t t 2. Construct representation …atagtt… …aattat… …attatt… …ttactt… a t g c 3. Score all possible windows in the data Set based on: 4. Output results that exist over a specified threshold from data set
Finding Unknown Motifs • Input a set of co-expressed sequences that are related by micro-array experiment 3. Score all possible windows by first Constructing a multiple sequence Alignment of the window to all other possible matches in the other sequences 2. Input: motif length n 4. Rank the set of all possible scoring matrices of length n based on information content relative to background. 5. Output an ordered list of motifs and corresponding scoring matrices.
AGAST: Advanced Grammar Alignment Search Tool • Capitalize on the advantages of alignment. • Provide a formal and robust method for computing bio-relationships • Provide optimum results based on the input. • Calculate relationships in the same time as alignment. • Allow for user knowledge and subsequence relationships. • Record attributes and sequence attributes can be considered simultaneously. • Dynamically construct requisite algorithms in a user friendly way, thus limiting development time and technical knowledge requirements.
AGAST: Advanced Grammar Alignment Search Tool Advantages: • It can evaluate regular expressions important to biology as well as traditional RegEx tools. • It can evaluate traditional alignments. • It can do any combination of RegEx and traditional alignments.
Example of an Advanced Query Find a sequence that contains: tatatagcagcccatgagccggcccgcadtgctagttcag Any Number of Bases 5-10 bases Functional Unit Start Codon Transcription Start Site Example Query: tatata.*{5,10}atg.*[gc]ca[at]gct[atgc]g:2.* tatatagcaggggcccatgagccggcccccadagctcgttcag Score: 0 tatatagcagcccatgagccggcccgcadtgagttcag Score: 2
Conventional Problem • Motif Searching programs do not calculate based on combinatorial regulation modules (instead they calculate based on probability of a single motif). • We have developed and tested a program that considers an ordered set of motifs (or sequence attributes) and searches based on a context of adjacent elements (or grammar).
Next Steps • Extract multiple motifs in the context of regulatory control networks. • Use phylogenetic footprinting and gene regulatory network information to compare and contrast gene regulation networks and extrapolate combinatorial control mechanisms and corresponding motifs upstream of genes.
Other Current Research Projects • Advanced tools for identifying splice sites • Using ab initio Bayesian networks based approaches • Using homology graph theory based approaches • Fast Recognition of Microorganisms using enzyme cutting sequence, mass spectrometry or sequence based approaches • Gene Prediction using Comparative Genomics • Reconstructing Gene Regulatory Networks • Clustering Techniques for Simplifying Protein Sequences
Acknowledgment • Kiran Bastola • Alexander Churbanov • Xutao Deng • Huimin Geng • Steven Hinrichs • Xiaolu Huang • Daniel Kuyper • Mark Pauley • Daniel Quest