280 likes | 442 Views
Data Mining Approaches in Atomistic Modeling. H. Aourag URMER, University of Tlemcen. Outline. Introduction Ex 1: Intergranular Embrittlement of Fe Ex 2: Catalytic Activity - Hydrogenation Ex 3: Stainless Steel Cr x Ni y Fe (1-x-y) Ex 4: Conductivity T7 7xxx Al Alloys
E N D
Data Mining Approaches in Atomistic Modeling H. Aourag URMER, University of Tlemcen AMASS – 7/25/03
Outline • Introduction • Ex 1: Intergranular Embrittlement of Fe • Ex 2: Catalytic Activity - Hydrogenation • Ex 3: Stainless Steel CrxNiyFe(1-x-y) • Ex 4: Conductivity T7 7xxx Al Alloys • Ex 5: Boiling Points • Ex 6: Crystal Structure Prediction – open questions… AMASS – 7/25/03
? • Atomistic modeling • Atom positions • Electronic structure • Energies Band Gap Elastic Constants Direct calculation Band Gap Elastic Constants Segregation Energies Activation Barriers Physical laws Constitutive relations Embrittlement Transport • Macroscopic properties • Elastic properties • Conductivity • Toxicity Atomic Scale Descriptors Weldability Toxicity Data Mining Predicting Properties with Atomistic Modeling AMASS – 7/25/03
R Calculated Atomistic Properties Database Measured Macroscopic Properties Database R Predicted Macroscopic Properties Database Calculated Atomistic Properties Database Power of Data Mining Use known data to establish R • Does not require complete and accurate multiscale theories • New physics in relationships R • Quick, cheap screening for desired properties, errors, etc. – can be qualitative Use R to predict new data AMASS – 7/25/03
Atomic scale descriptors Macroscopic Properties Data Mining Key Issues • Descriptors accessible to modeling • Descriptors optimally chosen • Use known relationships/physics • Optimize from large set of possibilities • Descriptors→Property relationship is robust • Sensible choice of methods • tested with cross validation, test sets • Data • Large enough • Clean enough AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe • Property: Fe embrittlement • Descriptors→Property relationship: Embrittlement [Grain boundary segregation E - Free surface segregation E] = (EGB – EFS) (Rice ’89) • Descriptors:(EGB – EFS) (calculated ab initio) • Data: Embrittling potency for B, C, P, S. AMASS – 7/25/03
Ex 1: Intergranular Embrittlement of Fe (Wu, et al., Phys. Rev. B., ‘96) Also correctly predicts effect of Mn and Mo on P embrittlement! (Zhong, et al., Phys Rev B, ’97, Geng, et al., Solid State Comm., ’01) AMASS – 7/25/03
Ex 2: Catalytic Activity - Hydrogenation • Property: Reaction rates (Hydrogenation of ethene, benzene on 3d transition metal M) • Descriptors→Property relationship: Adapted Bronsted-Evans_Polanyi Free E + Langmuir-Hinshelwood Rate Equations Rate = R[EMC,12 fitting “constants” independent of M] • Descriptors: • EMC = M-C bond strength in bulk NaCl structure (calculated ab initio) • 12 fitting “constants” (fit to experimental data for each reaction) • Data: 10-20 reaction rates for each of ethene and benzene AMASS – 7/25/03
Cross-validation in black Cross-validation with alloys EMC EMC Ethene: C2H4+H2→C2H6 Benzene: C6H6+3H2→C6H12 Ex 2: Catalytic Activity - Hydrogenation (Toulhoat, et al. ’02) AMASS – 7/25/03
Ex 3: Stainless Steel CrxNiyFe(1-x-y) • Property: High hardness and ductility • Descriptors→Property relationship: Hardness shear modulus = G Ductility bulk modulus/shear modulus = B/G • Descriptors: B,G (from ab initio) • Data: Not clearly defined AMASS – 7/25/03
Vickers Hardness [GPa] Shear Modulus [GPa] Hardness vs. Shear Modulus (Teter, MRS Bulletin, ’98) AMASS – 7/25/03
Shear Modulus G Bulk Modulus B Cr (at%) Cr (at%) Ni (at%) Ni (at%) High Low Ex 3: Stainless Steel CrxNiyFe(1-x-y)) (Vitos, et al., Nature Materials, ‘02) • Optimal at ~Cr18Ni24Fe58 (multiple patents) • Predict improved mechanical properties for Ir, Os doping High G (hard) Conflict! High B/G (ductile) AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys • Property: Electrical conductivity s • Descriptors→Property relationship: • Linear: s = V*d (requires only fitting) • Neurofuzzy: s = NF(d) (requires only fitting) • Physical:s = P(d) (requires thermodynamic models of relevant phases, Rayleigh–Maxwell equation for resistivity with dispersed particles, Starink-Zahra equation for precipitation, 1D diffusion equation, Matthiesen’s rule for resistivity with dissolved elements) • Descriptors: Concentrations, ageing time d = xZn, xMg, xCu, xZr, xFe, xSi, t AMASS – 7/25/03
Ex 4: Conductivity T7 7xxx Al Alloys s measured for 36 concentration/ageing time samples (Starink, et al., ‘00) AMASS – 7/25/03
Ex 5: Boiling Points (Quantitative Structure-Property Relationships: QSPR) • Property: Boiling Point TB • Descriptors→Property relationship: Neural Network (10:18:1, sigmoid, backpropagation) • Descriptors: Electrostatic and structural properties (calculated with semiempirical VAMP – AM1) • Data: TB for 6629 molecules containing elements H, B, C, N, O, F, Al, Si, P, S, Cl, Zn, Ge, Br, Sn, I, Hg AMASS – 7/25/03
Out In Data Mining Descriptors→Property Relationships Many general approaches • Graphical • Linear Regressions (normal least squares, principal component regression, partial least squares, …) • Neural Networks (perceptrons, feed-forward, radial-basis, …) • Clustering (k-means, nearest-neighbor, …) • Many choices in each approach • Neural Networks: • Number of neurons/layers – 3:4:1 • Transfer functions: step, sigmoid, tansig, etc. • Training method: backpropagation algorithms • Thousands of possible approaches! • Many yield similar results • Appropriate for different situations • Problem dependent - much art!! AMASS – 7/25/03
Descriptors Charged partial surface areas descriptors, Accelyris QSAR module • Partial positive surface area (sum of the surface area of positive atoms) • Partial negative surface area (sum of the surface area of negative atoms) • Total charge weighted positive surface area (descriptor 1 multiplied by the total positive charge) • Total charge weighted negative surface area (descriptor 2 multiplied by the total negative charge) • Atomic charge weighted positive surface area: (sum of sasa*charge for all positive atoms) • Atomic charge weighted negative surface area (sum of sasa*charge for all negative atoms) • Difference in charged surface areas: (descriptor 1 - descriptor 2) • Difference in total charge weighted surface areas (descriptor 3 - descriptor 4) • Difference in atomic charge weighted surface areas (descriptor 5 - descriptor 6) • Fractional charged partial surface areas (6 descriptors divided by total surface area) • " • " • " • " • " • Surface weighted charged partial surface areas (6 descriptors multiplied by total surface area) • " • " • " • " • " • Relative positive charge (charge of most positive atom divided by total positive charge • Relative negative charge (charge of most negative atom divided by total negative charge • Relative positive charge surface area (surface area of most positive atom divided by descriptor 22) • Relative negative charge surface area (surface area of most negative atom divided by descriptor 23) • Total hydrophobic surface area (sum of surface areas of atoms with |charge| < 0.2) • Total polar surface area (sum of surface areas of atoms with |charge| > 0.2) • Relative hydrophobic surface area (descriptor 26 divided by total surface area) • Relative polar surface area (descriptor 27 divided by total surface area) • Total solvent-accessible surface area (http://www.accelrys.com/cerius2/descriptor.html#list) AMASS – 7/25/03
Descriptors • Many broad categories: composition, topological, electronic, physical-chemical properties, … • Thousands of possible descriptors • Use physical knowledge to choose relevant ones (e.g., QSAR principle) • Use numerical methods to choose important descriptors AMASS – 7/25/03
Ex 5: Boiling Point Descriptors (Chalk, et al., J Chem. Inf. Comput. Sci, ‘01) AMASS – 7/25/03
Ex 5: Atomistic Modeling Methods Use VAMP – AM1 and PM3 Hamiltonians • Semi-empirical molecular orbital based • Quantum mechanical, but matrix elements are fit to experimental data • Can calculate optimized geometries, electronic structure (charge properties) • Fairly accurate (known failings) and fast AMASS – 7/25/03
Ex 5: Boiling Points Test set (629) Training set (6000) 17 (max -119) 19 (max -94) (Chalk, et al., J Chem. Inf. Comput. Sci, ‘01) • Large errors often due to • Incorrect experimental measurements of TB (low pressure) • Incorrect experimental structures (tautomer misidentification) • Failure of atomistic modeling method (approximation errors) AMASS – 7/25/03
Ex 6: Crystal Structure Prediction • Property: Stable crystal structure • Descriptors→Property relationship: Neighbor Clustering algorithm (Euclidean metric) • Descriptors: Chemical scale (empirically assigned value for each element) (Pettifor, J. Phys. C, ’86) • Data: All intermetallic binary alloys (thousands) AMASS – 7/25/03
Structure Maps CsCl NaCl AMASS – 7/25/03 (Rodgers, CRYSTMET, ‘03)
Ex 6: Crystal Structure Prediction • Powerful: structure maps can give 90-95% predictive accuracy • Many Descriptors: ~50 have been tried based on size, atomic number, cohesive energy, electrochemistry, valence electrons • Can’t be extended: accurate maps require ~40% of the possible systems to be known (~80% binaries known, ~0.1% quaternaries) • Can atomistic modeling help? • Fill in data for multicomponent systems • Provide optimal descriptors (Villars, Intermetallic Compounds, ’94) AMASS – 7/25/03
Conclusions • Atomistic modeling and data mining can provide valuable predictive ability when physical theories are incomplete • Key issues are data quality, descriptors, and descriptor→properties relationship • Dangers of overfitting and tuning AMASS – 7/25/03
Bible Code Are these words closer than by chance? Can the Bible predict future events? Some say yes (Witzumn, et al, Stat. Sci., ’94) Some say no (McKay, et al., Stat. Sci., ’99) • Many articles • >60 books on Bible Codes on Amazon • 1 major motion picture (Omega Code) Be careful with your statistics! AMASS – 7/25/03
The First and Greatest Example of Atomic Level Data Mining AMASS – 7/25/03
END AMASS – 7/25/03