10 likes | 161 Views
QSAR MODELLING OF THE BIODEGRADATION BY HOLISTIC MOLECULAR DESCRIPTORS. P. Gramatica 1 , M. Pavan 1 , F. Consolaro 1 , V. Consonni 2 and R. Todeschini 2 1 QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, ITALY
E N D
QSAR MODELLING OFTHE BIODEGRADATION BY HOLISTIC MOLECULAR DESCRIPTORS P. Gramatica1, M. Pavan1, F. Consolaro1, V. Consonni2 and R. Todeschini2 1QSAR Research Unit, Dept. of Structural and Functional Biology, University of Insubria, Varese, ITALY 2Milano Chemometrics & QSAR Research Group, Dept. of Environmental Sciences, University of Milano Bicocca, Milano, ITALY E-mail: paola.gramatica@unimi.it Web-site: http://fisio.dipbsf.uninsubria.it/dbsf/qsar/QSAR.html • INTRODUCTION • Environmental fate of a chemical is strictly related to its biodegradability. A good prediction of biodegradation would greatly aid in planning the synthesis of chemicals for environmental uses. During recent years, many approaches have been realised to model biodegradation data with predictive purposes: most of them are based on quantitative structure-biodegradability relationship (QSBR) and mainly on a structure representation by molecular fragments (i.e. functional groups, number of atoms, etc.). • Our approach to predict the biodegradability is based on an holistic representation of a chemical, by using a set of molecular descriptors that account not only for local characteristics of a structure, but also for general aspects, allowing the extension to multifunctional heterogeneous compounds. Due to the great variability of biodegradation data and the difficulty to consider a well-defined end-point we have applied our descriptors to different aspect of biodegradation: in regression modelling of BOD, ThOD, degradation rate constants and in classification on various biodegradability criteria. • MOLECULAR DESCRIPTORS • The molecular structure has been represented by a wide set of 657 molecular descriptors calculated by the software DRAGON1: • constitutional descriptors (56) • topological descriptors (69) • walk counts (20) • BCUT descriptors (7) • Galvez index (21) • 2D autocorrelation descriptors • charge descriptors (7) • aromaticity descriptors (4) • molecular profiles (40) • geometrical descriptors (18) • 3D-MoRSE descriptors (160) • WHIM descriptors (99) 2 • GETAWAY descriptors (196) • empirical descriptors (3) • [1] R.Todeschini and V.Consonni - DRAGON - Software for the calculation of molecular descriptors, Talete s.r.l. Milan (Italy) 2000. Download: http://www.disat.unimib.it/chm • [2] R.Todeschini and P.Gramatica, 3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the WHIM descriptors, Quant.Struct.-Act.Relat., 16 (1997) 113-119. • REGRESSION MODELS • The regression models have been applied on different data set: 43 alcohols, chetons and aromatic compounds; 28 alchols and chetons; 15 anilines and phenols; 17 PCBs and 43 heterogeneous compounds. • Our representation of a chemical is based on 670 molecular descriptors, thus an effective variable selection strategy is necessary. GA-VSS (Genetic Algorithm - Variable Subset Selection) was applied to the whole set of descriptors in order to set out the most variables in modelling the biodegradation end-points by Ordinary Least Squares regression (OLS). • Regression models have been obtained with satisfactory prediction power. All the models have been also validated on an external test set, by splitting the original data set in representative training and test sets by different approaches on structural similarity. • BIODEGRADABILITY CLASSIFICATION • Different chemometric methods (CART, K-NN and RDA) were used in order to classify 296 chemicals of environmental concern according to some literature biodegradability criteria obtaining satisfactory results. The selection of the best subset of variables were realized by Genetic Algorithm (GA-VSS) on Logistic regression (Rlog), a regression method useful when there is a restriction on the possible values of the dependent variable Y, and by PLS-DA, which confirmed the results previously obtained. It is important to point out that the literature criteria disagree in most of the cases so that we had to compare them in order to find a new general classification criteria for the compounds studied; the comparison was realised as the scheme below shows. All the models developed on an opportunely selected training set have been validated • internally (ER) and externally (ERext). BEST MODEL PARAMETERS Training set selection procedure Data set 296 compounds HATS5v: leverage-weighted autocorrelation of lag 5 (weighted by atomic van der Waals volumes) R8m+: R maximal autocorrelation of lag 8 (weighted by atomic masses) PREDICTION Available biodegradability data 152 compounds Not available biodegradability data 144 compounds SPLITTING Training set 77 compounds Test set 75 compounds PREDICTION BEST MODEL PARAMETERS BENe6: negative Burden eigenvalue n. 6 (weighted byb atomic Sanderson electronegativities) Ds: WHIM total accessibility index (weighted by atomic electrotopological states) Linear Discriminant Analysis (LDA) model: variables: nX, nN, P1u, Ku, Dm, ATS2p, MEC, Mor04v No Model Error Rate % (NOMER): 32.5 nX: n. of halogen atoms nN: n. of Nitrogen atoms P1u: 1st component shape directional WHIM index Ku: global shape WHIM index Dm: total accessibility WHIM index ATS2p: autocorrelation index of a topological structure MEC: molecular eccentricity Mor04v: 3D-MoRSE-signal 04 (weighted by atomic van der Waals volumes) Confusion matrix in fitting Confusion matrix in prediction CONCLUSIONS Different kinds of holistic molecular descriptors appear relevant in the modelling of the biodegradability. Both in regression models and in classification models molecular descriptors taking into account global structural properties of the molecules have been selected by Genetic Algorithm as correlated to biodegradability and in same cases added to local descriptors.