400 likes | 415 Views
KDD-2001 Cup The Genomics Challenge. Advisor : Dr. Hsu Graduate : Min-Hong Lin IDSL seminar. Outline. Motivation Objective KDD Cup 2001 Report Task1:Thrombin Result Task2:Predicting Function Task3:Localization Conclusions Personal Opinion. Motivation.
E N D
KDD-2001 CupThe Genomics Challenge Advisor: Dr. Hsu Graduate:Min-Hong Lin IDSL seminar
Outline • Motivation • Objective • KDD Cup 2001 Report • Task1:Thrombin Result • Task2:Predicting Function • Task3:Localization • Conclusions • Personal Opinion IDSL
Motivation • Because of the rapid growth interest in mining biological databases. • Bioinformatics datasets are typically under-determined • very large number of features (complex domain) • small number of instances (high cost per data point) IDSL
Objective • KDD Cup 2001 was focused on mining biological databases. It related to • drug design • genomics. IDSL
Dataset 1: Prediction of Molecular Bioactivity for Drug Design-Binding to Thrombin • Dataset provided by DuPont Pharmaceuticals • Activity of compounds binding to thrombin • Library of compounds included(training data): • 1909 known molecules (42 actively binding thrombin) • 139,351 binary features describe the 3-D structure of each compound • 636 new compounds with unknown capacity to bind thrombin(test data) IDSL
Dataset2: Prediction of Gene/Protein Function and Localization • Yeast Genome dataset • Data on the protein-protein interactions from MIPS database (Munich Information Centre for Protein Sequences) • Genes that encode for 6449 yeast proteins are already known, only 52% of these proteins have been characterized. • Relational dataset • Gene information • Interaction information • Predict function,localization of unknown proteins IDSL
Statistics: I. Participation • 136 groups participated(200 submissions) • Almost 5-fold increase over previous years • More than half of the entries from commercial sector IDSL
Statistics: II. Data Mining Software • Mostly custom software was used • Especially for task 1, where the number of features was too large for most commercial systems • Gap points to need for commercial tools that can cope with bioinformatics datasets IDSL
Statistics: III. Algorithms • Feature selection used in almost 70% of the entries for Task 1 • Ensemble classifiers based on more than one algorithm used extensively • Decision trees among the most commonly used, with Naïve Bayes and k-NN • Cross-validation to deal with small dataset size IDSL
KDD-2001 Cup Winners • Task 1: Jie Cheng, CIBC(Canadian Imperial Bank of Commerce ) • Task 2: Mark-A. Krogel, Magdeburg Univ. • Task 3: Hisashi Hayashi, Jun Sese, and Shinichi Morishita, Univ. of Tokyo IDSL
Task 1:Thrombin Result • Object • Prediction of molecular bioactivity for drug design -- binding to Thrombin • Data • Training: 1909 cases (42 positive), 139,351 binary features • Test: 634 cases • Challenge • Highly imbalanced, high-dimensional, different distribution • Approach • Bayesian network predictive model IDSL
Bayesian Network • A Bayesian network B=<N,A,Θ> is a directed acyclic graph (DAG) <N,A> • Each node n є N represents a domain variable • Each arc a є A between nodes represents a probabilistic dependency • Quantified using a conditional probability distribution(CP table) θi є Θ for each node ni • A major advantage of BNs is that the Bayesian network structure represents the inter-relationships among the dataset attributes. IDSL
Bayesian network structure of ‘Adult’ data • Two ways to view it: • Represents the joint probability distribution of the attributes • Encodes the conditional independence relationships among the nodes IDSL
Our approach to Thrombin Data • Pre-processing: Feature subset selection using mutual information (200 of 139,351 features) IDSL
Learning and evaluating BN models • The BN PowerPredictor system allows users to control the complexity of the learned network by adjusting a threshold value. • The system allows users to choose from two commonly used performance measures: • The prediction accuracy • The area under ROC curve(AUC) • Five candidate models was generated from the preprocessed training data set • Each of the five candidates had from 2~12features. IDSL
Activity 10695 91839 16794 79651 Learning and evaluating BN models • For each candidate we used it to classify the training set and measured its AUC scores • Then picked the simplest model that had a “decent” AUC score IDSL
Classifying the testing set • Using the chosen model, created the posterior probabilities of each instance in the test dataset. • Decide the cut point to classify the test cases into either active or inactive • 8 possible cut points to choose from 32,71,72,74,75,215,223,550 • Decide to classify 223 cases as active IDSL
Analyzing the result Accuracy: 0.711 Weighted Accuracy: 0.684 sensitivity 1-specificity IDSL
Conclusions • The combination of information gain based feature filtering and the Bayesian net based feature selection is a novel, effective approach for analyzing high-dimensional data. • We gained awareness of the overfitting problem when out-of-sample validation is impossible, especially when the sample size is small. • One should carefully choose performance measures that are cost function independent when a well-defined cost function is not available, such as AUC. IDSL
Task 2:Gene/Protein Function Prediction • RELAGGS(Multirelational Learning Algorithm ) was developed at Magdeburg University • RELAGGS is intended to deal with relational data • RELAGGS had been tested on relational datasets from financial domains IDSL
Preprocessing with SQL • General: renormalize into multiple tables as a natural representation of the data • The genes_relation contained 862 training examples, 381 test examples. • Specific for KDD Cup tasks 2/3: consider only interactions with high correlations, assume transitivity, make symmetry explicit IDSL
Preprocessing with RELAGGS • It takes as input a description of the tables • RELAGGS uses the foreign link information to compute join definitions • Performs automatic transformation of multiple tables into single table with the help of aggregate functions • Uses propositional learner such as C4.5 or SVMlight IDSL
Data Mining withSVMlight • An SVMlight run on the RELAGGS output resulted in model files from the training genes and in prediction files for the test genes IDSL
Postprocessing with SQL • The predictions for single functions and localizations had to be integrated into a final solution IDSL
Conclusion • From 10-fold cross-validation: • Accuracies: 92.9% on task 272.5% on task 3 • From the Cup organizers: • Accuracies: 93.6% on task 2: rank 1 69.8% on task 3: rank 4 IDSL
Task 3:Localization • Task • Predict the localization of a given gene in a cell among 15 distinct positions • Data • Relation table with six categorical attributes Essential, Class, Complex, Phenotype, Motif, Chromosome Number • Interaction matrix listing all the interactions between genes • Training: 862 training genes • Test: 381 test genes IDSL
Characteristic of Dataset • Dataset 2 has three interesting features: • The dataset contains many missing values • The domain of the objective attribute contains 15 non-ordered values • The dataset is a mixture of two types of data IDSL
Coping with Missing Values • Class,Complex, and Motif are highly correlated with localization • With regard to the binary interaction relationship • Genes that interacted with the focusing gene were usually located in the same part of the cell • Compensate for the missing information by using information on the three attributes and the binary interaction relationship IDSL
Different Test Approaches • Applied three independent approaches to the data analysis • Decision trees with correlated association rules • Adaboost (Boosting correlated association rules) • Nearest neighbor method • The nearest neighbor method worked best for the training dataset IDSL
Nearest Neighbor Analysis • Attribute Agreement of Records • r1 and r2in R are called agree on fiif r1 [fi] and r2[fi] share some common elements • r1[fi] ٨r2[fi] ‡ф IDSL
Gene1 Gene4 Gene2 Gene3 Neighbors • Two records are neighbors if they agree with respect to certain attributes. IDSL
Nearest Neighbor Assignment by Prioritizing Attributes • A single attribute may not be sufficient form accurate prediction • In cases where the number of neighbors is large • Ex:Complex->Class • (G235065 agree with G234126, located in the cytoplasm IDSL
Computing the nearest neighborhood • We denoted the final answer Nm as: NN(r,Rtrain,[g1,…,gm]) IDSL
Classification by Nearest Neighborhood Analysis • Let obj be an objective attribute, such as Localization, and let Dobj be its domain. • Calculate the objective value of r,r[obj] from the majority of objective values of nearest neighbors in NN(r,Rtrain,[g1,…,gm]) IDSL
Computing Optimal Priority • Theorem: It is NP-hard to compute [g1,…,gm] that optimizes accuracy(Rtest, Rtrain, [g1,…gm]) • Branch-and bound search technique for solving the optimization problem. • Proof IDSL
Experimental Results • The priority list Pmax=[Complex, Class, Interaction, Motif] • Accuracy(Strain,Strain, Pmax)=79% • Accuracy(Stest,Strain, Pmax)=72% IDSL
Conclusions • Lessons for mining biological databases • It is very surprising that protein interaction information was not more useful in Tasks 2 and 3 • A second lesson is the issue of interacting with the laboratory • General lessons for data mining • Bayes nets should not be rejected out of hand for pure classification tasks • The propositionalization often is a good approach to a relational learning task • The need for improved human-computer interaction and the question of how to handle a changing distribution over data IDSL
Personal Opinion • Current tools and approaches do not adequately address the Genomics Challenge • The step of handling missing values was most elaborated and time-consuming IDSL