160 likes | 259 Views
Predicting patterns of biological performance using chemical substructure features. Diego Borges-Rivera 08/04/08. 10111010001010101000101101. 01010001011011. Introduction. cheminformatics – allow us to computationally describe similarity
E N D
Predicting patterns of biological performance using chemical substructure features Diego Borges-Rivera 08/04/08
10111010001010101000101101 01010001011011 Introduction • cheminformatics – allow us to computationally describe similarity • synthetic chemists – describe through visual inspection • we will describe compounds by the presence of chemical substructures • we will attempt to identify sets of substructures that predict biological performance
10 20 30 40 50 60 substructures Previous work • Clemons/Kahne/Wagneret al. -- disaccharide profiling in multiple cell states • found sets of substructures relevant to biological activity patterns • substructures highly specific to disaccharides
Biological performance profile • 400 compounds, 8 assays in duplicate • tested for cell proliferation in 8 different cell lines • class labels are active (A) or inactive (I) active compound
What are fingerprints? • compound collection fed into commercial software • each substructure = 1 bit • the fingerprint shows which substructures are present substructure #7017 substructure #886 substructure #1725
Overview of cheminformatic methods • produced fingerprints 7700 total substructures • filtered set • left 2166 substructures
Overview of computational methods • two steps independent of each other feature (substructure) selection to find predictive subsets evaluate methods for predictive value
ReliefF: substructure selection Top 5 -1 0 +1 2166 weights Bottom 5
compound being classified = ? K nearest neighbors (knn): predictive accuracy • Examples: k = 2, 5
Similarity between compounds • similarity between two fingerprints • Tanimoto coefficient • this is used twice: • in ReliefF • in knn Example: Compound a: 0 0 1 Compound b: 1 0 1 Tanimoto coefficient = 1 / 2 = .5
test set training set Cross-validation: predictive accuracy • 10 subsets • test set: one of the subsets • training set: the remaining subsets
Picking parameters for methods • which parameters produce the best predictive accuracies • number of neighbors used in ReliefF {1, 2, 4, etc} • number of neighbors used in knn {1, 2, 4, etc} • number of ReliefF substructures used to predict classes in knn {1, 20, 100, etc}
1.0 .9 .8 .7 .6 .5 .4 .3 .2 .1 0.0 predictive accuracy 1 20 all number of substructures used to predict Picking number of substructures
Future work • multi-class • different feature selection
Acknowledgements Computational Chemical Biology Joshua Gilbert Paul Clemons Hyman Carrinski Summer Research Program in GenomicsShawna Young Lucia Vielma Maura Silverstein