400 likes | 417 Views
Explore the KDD Cup 2001 report focusing on mining biological databases for drug design and genomics. Tasks include predicting gene/protein function, localization, and molecular bioactivity for drug design. Learn about the datasets, statistics, algorithms, winners, and Bayesian network approach used in the competition. Discover key insights and conclusions from this innovative challenge in bioinformatics.
E N D
KDD-2001 CupThe Genomics Challenge Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar
Outline • Motivation • Objective • KDD Cup 2001 Report • Task1:Thrombin Result • Task2:Predicting Function • Task3:Localization • Conclusions • Personal Opinion IDSL
Motivation • Because of the rapid growth interest in mining biological databases. • Bioinformatics datasets are typically under-determined • very large number of features (complex domain) • small number of instances (high cost per data point) IDSL
Objective • KDD Cup 2001 was focused on mining biological databases. It related to • drug design • genomics. IDSL
Dataset 1: Prediction of Molecular Bioactivity for Drug Design-Binding to Thrombin • Dataset provided by DuPont Pharmaceuticals • Activity of compounds binding to thrombin • Library of compounds included(training data): • 1909 known molecules (42 actively binding thrombin) • 139,351 binary features describe the 3-D structure of each compound • 636 new compounds with unknown capacity to bind thrombin(test data) IDSL
Dataset2: Prediction of Gene/Protein Function and Localization • Yeast Genome dataset • Data on the protein-protein interactions from MIPS database (Munich Information Centre for Protein Sequences) • Genes that encode for 6449 yeast proteins are already known, only 52% of these proteins have been characterized. • Relational dataset • Gene information • Interaction information • Predict function,localization of unknown proteins IDSL
Statistics: I. Participation • 136 groups participated(200 submissions) • Almost 5-fold increase over previous years • More than half of the entries from commercial sector IDSL
Statistics: II. Data Mining Software • Mostly custom software was used • Especially for task 1, where the number of features was too large for most commercial systems • Gap points to need for commercial tools that can cope with bioinformatics datasets IDSL
Statistics: III. Algorithms • Feature selection used in almost 70% of the entries for Task 1 • Ensemble classifiers based on more than one algorithm used extensively • Decision trees among the most commonly used, with Naïve Bayes and k-NN • Cross-validation to deal with small dataset size IDSL
KDD-2001 Cup Winners • Task 1: Jie Cheng, CIBC(Canadian Imperial Bank of Commerce ) • Task 2: Mark-A. Krogel, Magdeburg Univ. • Task 3: Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Tokyo IDSL
Task 1:Thrombin Result • Object • Prediction of molecular bioactivity for drug design -- binding to Thrombin • Data • Training: 1909 cases (42 positive), 139,351 binary features • Test: 634 cases • Challenge • Highly imbalanced, high-dimensional, different distribution • Approach • Bayesian network predictive model IDSL
Bayesian Network • A Bayesian network B=<N,A,Θ> is a directed acyclic graph (DAG) <N,A> • Each node n є N represents a domain variable • Each arc a є A between nodes represents a probabilistic dependency • Quantified using a conditional probability distribution(CP table) θi є Θ for each node ni • A major advantage of BNs is that the Bayesian network structure represents the inter-relationships among the dataset attributes. IDSL
Bayesian network structure of ‘Adult’ data • Two ways to view it: • Represents the joint probability distribution of the attributes • Encodes the conditional independence relationships among the nodes IDSL
Our approach to Thrombin Data • Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) IDSL
Learning and evaluating BN models • The BN PowerPredictor system allows users to control the complexity of the learned network by adjusting a threshold value. • The system allows users to choose from two commonly used performance measures: • The prediction accuracy • The area under ROC curve(AUC) • Five candidate models was generated from the preprocessed training data set • Each of the five candidates had from 2~12features. IDSL
Activity 10695 91839 16794 79651 Learning and evaluating BN models • For each candidate we used it to classify the training set and measured its AUC scores • Then picked the simplest model that had a “decent” AUC score IDSL
Classifying the testing set • Using the chosen model, created the posterior probabilities of each instance in the test dataset. • Decide the cut point to classify the test cases into either active or inactive • 8 possible cut points to choose from 32,71,72,74,75,215,223,550 • Decide to classify 223 cases as active IDSL
Analyzing the result Accuracy: 0.711 Weighted Accuracy: 0.684 sensitivity 1-specificity IDSL
Conclusions • The combination of information gain based feature filtering and the Bayesian net based feature selection is a novel, effective approach for analyzing high-dimensional data. • We gained awareness of the overfitting problem when out-of-sample validation is impossible, especially when the sample size is small. • One should carefully choose performance measures that are cost function independent when a well-defined cost function is not available, such as AUC. IDSL
Task 2:Gene/Protein Function Prediction • RELAGGS(Multirelational Learning Algorithm ) was developed at Magdeburg University • RELAGGS is intended to deal with relational data • RELAGGS had been tested on relational datasets from financial domains IDSL
Preprocessing with SQL • General: renormalize into multiple tables as a natural representation of the data • The genes_relation contained 862 training examples, 381 test examples. • Specific for KDD Cup tasks 2/3: consider only interactions with high correlations, assume transitivity, make symmetry explicit IDSL
Preprocessing with RELAGGS • It takes as input a description of the tables • RELAGGS uses the foreign link information to compute join definitions • Performs automatic transformation of multiple tables into single table with the help of aggregate functions • Uses propositional learner such as C4.5 or SVMlight IDSL
Data Mining withSVMlight • An SVMlight run on the RELAGGS output resulted in model files from the training genes and in prediction files for the test genes IDSL
Postprocessing with SQL • The predictions for single functions and localizations had to be integrated into a final solution IDSL
Conclusion • From 10-fold cross-validation: • Accuracies: 92.9% on task 272.5% on task 3 • From the Cup organizers: • Accuracies: 93.6% on task 2: rank 1 69.8% on task 3: rank 4 IDSL
Task 3:Localization • Task • Predict the localization of a given gene in a cell among 15 distinct positions • Data • Relation table with six categorical attributes Essential, Class, Complex, Phenotype, Motif, Chromosome Number • Interaction matrix listing all the interactions between genes • Training: 862 training genes • Test: 381 test genes IDSL
Characteristic of Dataset • Dataset 2 has three interesting features: • The dataset contains many missing values • The domain of the objective attribute contains 15 non-ordered values • The dataset is a mixture of two types of data IDSL
Coping with Missing Values • Class,Complex, and Motif are highly correlated with localization • With regard to the binary interaction relationship • Genes that interacted with the focusing gene were usually located in the same part of the cell • Compensate for the missing information by using information on the three attributes and the binary interaction relationship IDSL
Different Test Approaches • Applied three independent approaches to the data analysis • Decision trees with correlated association rules • Adaboost (Boosting correlated association rules) • Nearest neighbor method • The nearest neighbor method worked best for the training dataset IDSL
Nearest Neighbor Analysis • Attribute Agreement of Records • r1 and r2in R are called agree on fiif r1 [fi] and r2[fi] share some common elements • r1[fi] ٨r2[fi] ‡ф IDSL
Gene1 Gene4 Gene2 Gene3 Neighbors • Two records are neighbors if they agree with respect to certain attributes. IDSL
Nearest Neighbor Assignment by Prioritizing Attributes • A single attribute may not be sufficient form accurate prediction • In cases where the number of neighbors is large • Ex:Complex->Class • (G235065 agree with G234126, located in the cytoplasm IDSL
Computing the nearest neighborhood • We denoted the final answer Nm as: NN(r,Rtrain,[g1,…,gm]) IDSL
Classification by Nearest Neighborhood Analysis • Let obj be an objective attribute, such as Localization, and let Dobj be its domain. • Calculate the objective value of r,r[obj] from the majority of objective values of nearest neighbors in NN(r,Rtrain,[g1,…,gm]) IDSL
Computing Optimal Priority • Theorem: It is NP-hard to compute [g1,…,gm] that optimizes accuracy(Rtest, Rtrain, [g1,…gm]) • Branch-and bound search technique for solving the optimization problem. • Proof IDSL
Experimental Results • The priority list Pmax=[Complex, Class, Interaction, Motif] • Accuracy(Strain,Strain, Pmax)=79% • Accuracy(Stest,Strain, Pmax)=72% IDSL
Conclusions • Lessons for mining biological databases • It is very surprising that protein interaction information was not more useful in Tasks 2 and 3 • A second lesson is the issue of interacting with the laboratory • General lessons for data mining • Bayes nets should not be rejected out of hand for pure classification tasks • The propositionalization often is a good approach to a relational learning task • The need for improved human-computer interaction and the question of how to handle a changing distribution over data IDSL
Personal Opinion • Current tools and approaches do not adequately address the Genomics Challenge • The step of handling missing values was most elaborated and time-consuming IDSL