Selection of Molecular Descriptor Subsets for Property Prediction

Selection of Molecular Descriptor Subsets for Property Prediction Inga Pastera, Neima Braunerb and Mordechai Shachama, aDepartment of Chemical Engineering, Ben-Gurion University Beer-Sheva, Israel bSchool of Engineering, Tel-Aviv University Tel-Aviv, Israel

The Needs • Physicochemical and biological properties are needed for risk assessment, environmental impact assessment and process design, analysis and optimization • The number of the compounds used at present by the industry or those of its immediate interest ~100,000. Those theoretically possible and may be of future interest several tens of millions. • The Toxic Substances Control Act (TSCA) inventory has 80,000 chemicals. Only 50% have some physicochemical property data, only 15% have data from genotoxicity bioassays • DIPPR 801 database contains 2101 compounds (33 constant properties, 15 temperature dependent properties)

Property Prediction Methods • “Group contribution” methods • Methods based on the "corresponding-states principle“ • “Asymptotic behavior" correlations (ABC’s) • “Quantitative Structure Property Relationships” (QSPR’s), based on the use of molecular descriptors The existing methods cannot provide satisfactory predictions for certain properties (such as normal melting temperature) and for certain groups of compounds. Thus, research and development of new prediction techniques are essential.

Collinearity Between Vectors of Descriptors of Similar Compounds Linear relationship between the descriptors 99 normalized molecular descriptors of n-heptane versus those of n-hexane.

Collinearity Between Vectors of Properties of Similar Compounds Linear relationship between the vectors of properties Selected properties of n-heptane versus those of n-hexane. Basis of the QS2PR method (Shacham et al, AIChE J. 50(10), 2481-2492, 2004)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds Measured value for 3,3-dimethylhexane Prediction error 0.68 % VRD2- Average Randic-type eigenvector-based index from distance matrix (eigenvalue-based indices)

Similarity Group (Training Set) of 3,3-dimethylhexane A measure of the level of group similarity Similarity group of 10 predictive compounds has found to be sufficient in most cases. Basis of the Targeted QSPR method (Brauner et al, I&EC Research45, 8430-8437, 2006)

Collinearity Between a Vector of Descriptors and a Vector of Properties for a Group of Similar Compounds Collinearity between the descriptor VEv1 and normal boiling temperature for the n-alkanoic acid homologous series

Sources of Molecular Descriptors and Thermo-Physical Properties • The molecular geometries were optimized using the CNDO (Complete Neglect of Differential Overlap) semi-empirical method implemented in the HyperChem package • The Dragon program (http://www.talete.mi.it ) was used to calculate 1664 descriptors for the 340 compounds in the database from minimized energy molecular models • Property data (measured and predicted) were taken from DIPPR (http://dippr.byu.edu) and NIST (National Institute of Standards, http://webbook.nist.gov/chemistry) databases.

Descriptor Types Generated by the Dragon Program 3-D descriptors, very sensitive to molecular structure minimization

Identifying Inaccuracy and Inconsistency Among 1600 Molecular Descriptors Sources of inaccuracy and inconsistency: The descriptor cannot be calculated by DRAGON (-999); The descriptor value is set at zero for certain compounds; and Sensitivity of 3-D descriptors to the structure minimization method

Presentation Outline • Categorizing the Molecular Descriptors According to the Trend of Their Change with nC for Homologous Series • Identifying Training Sets from Compounds Belonging to the Target Compounds Homologous Series • Predicting Critical Properties, Normal Boiling and Melting Temperatures, Liquid Molar Volume and Refractive Index for Five Homologous Series with and without the Use of 3-D descriptors. • Comparison of the Results and Conclusions

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series The descriptor ADDD changes with nCfor the 1-alkene series in a trend similar to the change of liquid molar volume

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series Similar to the trend of TC Normalized values of the descriptors AGDD, ASP and H4m versus nCfor the 1-alkene homologous series

Checking Consistency of Molecular Descriptors – Consistent Change with nC for Homologous Series The descriptor ICR changes with nCfor the 1-alkene series in a trend similar to the change of normal melting temperature

Checking Consistency of Molecular Descriptors – Inconsistent Change with nC for Homologous Series The descriptor Gm changes with nCfor the 1-alkene series in an apparently random manner

Trend of change of descriptors with nC for homologous series Constant descriptors identify compounds of the HS of the target compound and linearly increasing descriptors used to rank the compounds according to the distance from the target

Prediction of TC, Tb and RI (Refractive Index) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In ~ 93 % of the cases descriptors of category IIIA used as dominant (1st to enter, out of one or two) descriptor. Exception 3-D descriptors for 1-alcohols (category IV)

Prediction of VC and Vm (Liquid molar vol.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 90 % of the cases descriptors of category II used Exception: 3-D descriptors for 1-alkenes, 1-alcohols (category IV)

Prediction of PC and Tm (Melting Point.) for n-alkanes, 1-alkenes, n-alkylbenzenes, 1-alcohols and n-alkanoic acids In 40 % of the cases descriptors of category IIIA used, descriptors IV 35%, descriptors V 20% , descriptor II 5 %.

Uncertainty (%) in Predicting Various Properties Without 3-D Descriptors Large prediction errors in Vc (and Pc) because of the uncertainty of the DIPPR data. The irregular shape of the melting point curve causes the errors in this property (3-D descriptors needed).

Conclusions • The Dragon descriptors were divided into seven categories according to the trend of their change as function of nc in homologous series. • It was observed that 3-D descriptors may exhibit very irregular (or even random) behavior. • The exclusive use of descriptors of two categories: “Constant” and “Linear Increase”, enabled selection of training sets belonging to the target compound’s homologous series. • The use of the proposed method for predicting 7 properties for 5 homologous series has shown that most properties can be predicted on experimental uncertainty level, without using 3-D descriptors. This extends the method’s applicability, increases its reliability and reduces the probability of “Chance Correlations”.

Selection of Molecular Descriptor Subsets for Property Prediction

Selection of Molecular Descriptor Subsets for Property Prediction

Presentation Transcript

Chap. 8 Molecular Structure Prediction

Prediction of molecular properties (I)

A Study on Feature Selection for Toxicity Prediction *

Molecular Profiling and Patient Selection

Cloud Computing for Chemical Property Prediction

Segment Descriptor

Segment Descriptor

Descriptor

Early molecular prediction of response to TKI

Descriptor 3

Selection of Molecular Descriptor Subsets for Property Prediction

Property Prediction and CAMD

SIFT DESCRIPTOR

Module Descriptor

Subsets of Real Numbers

Algorithm for the prediction of molecular Geometry:

volume distortion for subsets of R n

Prediction of molecular properties (I)

Selection of Semantic Pairs of Descriptor Terms for 1 st Survey

3.8 Subsets ( )

Subsets

Property Prediction and CAMD