380 likes | 403 Views
Learn about the concise linear notation SMILES, widely used for representing atoms, bonds, and connectivity. Discover how to input SMILES and generate canonical forms. Explore examples and algorithms for canonicalizing SMILES. Gain insights on representing reactions and generic structures using SMILES and related concepts like SMIRKS and SMARTS.
E N D
SMILES 2 C371 Lecture Based on Dr. David Wild’s C571 Presentations Fall 2004
Linear Notations • Represent the atoms, bonds, and connectivity as a linear text string • SMILES • Concise • Orignally designed for manual command line entry into text-only systems • Now widely used • Can be input to a spreadsheet cell, on one line of a text file, or in an Oracle database text field • System to generate canonical form of SMILES
Review of SMILES • Atoms represented by normal chemical symbols (uppercase for aliphatics, lowercase for aromatic) • Adjacent atoms imply single bonds • Use = for double, # for triple bonds • Hydrogens usually implicit • Parentheses imply branching • Ring closure indicated by numbers
SMILES Review (cont’d) • Can make Hydrogens explicit • Non-organic atoms are put in square brackets, e.g., [Xe] • Charged species also in square brackets with a + or -, e.g., [Na+] or [O-] • Unknown atoms indicated by a * • Stereochemistry represented by @@
SMILES for Tyrosine NC(Cc1ccc(O)cc1)C(=O)O
SMILES FOR Acetaminophen (Tylenol) O=C(O)Nc1ccc(O)cc1
SMILES for Isatin O=c2[nH]c1ccccc1c2=O
Canonicalizing SMILES – Morgan Algorithm • Each atom has a connectivity value: how many atoms it is connected to • That value is replaced by the sum of the connectivity values of the its neighbors • Continues iteratively, until number of different values is maximized • Atoms are numbered in decreasing order of connectivity value • In case of a tie, other properties are used (e.g. atomic number, bond order, etc).
Canonicalizing SMILES – CANGEN • Two-stage procedure used by Daylight • First stage CANON, generates a canonical connection table using a modified version of the Morgan Algorithm that produces a tree structure • Second stage GENES creates a unique SMILES using a depth-first search of a the molecular graph tree output by CANON • More information – JCICS 29,1989,97-101
Representing reactions CH4 + 2O2 CO2 + 2H2O • Need to identify the 2D arrangement of products and reagents and distinguish them) • Possibly map which starting material atoms map to which product atoms. • Other information (e.g., yield, equilibrium constants, conditions generally stored separately • Not all reactions specified stoichiometrically
Simple Reaction SMILES • Each reagent and product represented as SMILES • Reagents on the left of a “>>”; products on the right • Individual reagents and products are separated by a “.” CH4 + 2O2 CO2 + 2H2O Reaction SMILES: C.OO>>C(O)O.O
Reaction SMILES example • Agents specified between the two “>>” Reaction SMILES: C.O=O>O=[O+]-[O-]>O=C=O.O
Reaction SMILES example • Note implicit hydrogens Reaction SMILES: C(=O)Cl.NC>>C(=O)NC.Cl
Atom-mapping SMIRKS representation • Each reactant atom gets a tag (e.g “C” becomes “[C:1]”) which maps to the same product tag. • Hydrogens are explicit SMIRKS: [C:1](=[O:2])[Cl:3].[H:99][N:4]([H:100])[C:0]>>[C:1](=[O:2])[N:4]([H:100])[C:0].[Cl:3][H:99]
Daylight RS/SMIRKS Sites • Basic reaction representation (Reaction SMILES) • http://www.daylight.com/dayhtml_tutorials/languages/smiles/index.html • SMIRKS introduction • http://www.daylight.com/dayhtml_tutorials/languages/smirks/index.html • SMIRKS theory • http://www.daylight.com/dayhtml/doc/theory/theory.rxn.html • SMIRKS depicter • http://www.daylight.com/daycgi_tutorials/react.cgi
Representing generic structures • A generic structure is one which, by ambiguity, represents a (possibly infinite) set of possible structures • Ambiguity usually takes the form of “R” groups • Originally used for representing patents • Now used for representing combinatorial libraries too • Also known as Markush Structures
Specifying a substructure query with SMARTS • SMARTS: a superset of SMILES extended to allow partial structures (substructures) and optional parts of molecules to be represented • Simple example *C(=O)O where the * represents an attachment point (i.e. any number of any atoms) • More information: • http://www.daylight.com/meetings/summerschool01/course/basics/smarts.html • http://www.daylight.com/dayhtml/doc/theory/theory.smarts.html
Try out a SMARTS search • DepictMatch: • http://www.daylight.com/cgi-bin/contrib/depictmatch.cgi • Enter a set of SMILES and a SMARTS, and any part of the SMILES that is found in the SMARTS is highlighted • As an example, we’ll use the sample dataset described on the following two slides, and use *C(=O)O (carboxyl group) as our SMARTS and RC(=O)O (carboxyl attached to a ring)
Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate
Web / Oracle Systems • Advantages • Single database for structures and data • No software to install on client machines (except maybe plug-ins like Chime) • Not dependent on (expensive) contract with MDL • Highly customizable • Disadvantages • Requires extensive web-based interface software to be written, for registration, searching, etc • Company will have to maintain system internally • Requires current ISIS system to be abandoned
Chemistry Cartridges • Daylight DayCart • http://www.daylight.com/products/daycart.html • Tripos Auspyx • http://www.tripos.com/sciTech/inSilicoDisc/chemInfo/auspyx.html • Accelrys Accord for Oracle • http://www.accelrys.com/accord/oracle.html • MDL Direct • http://www.mdl.com/products/framework/rel_chemistry_server/index.jsp • IDBS ActivityBase • http://www.id-bs.com/products/abase/ • JChem Cartridge • http://www.jchem.com
Example - DayCart • Store SMILES as string (VARCHAR2) in Oracle database • Cartridge provides extra functions and extensions to functions for searching based on chemical structures • Structure search implemented by EXACT function • Substructure search implemented by MATCHES function • Similarity search implemented by TANIMOTO and EUCLID functions
Measuring similarity between molecules • Similar Property Principle: “Molecules with similar structure are likely to have similar biological activity” • Generally the Tanimoto Coefficient or Euclidean Distance between fingerprints is used
c Tanimoto Similarity = #a + #b - c Fingerprint Similarity – Tanimoto • Also known as Jaccard Coefficient • ‘1s’ in common / ‘1s’ not in common • 0’s are treated as not significant • Similarity is between 0 (dissimilar) and 1 (same) • Good cutoff for likely biologically similar molecules is 0.7 or 0.8 c = ‘1’s in common #a = ‘1’s in fingerprint A #b = ‘1’s in fingerprint B A 101101011 B 011101101 c = 4 #a = 6 #b = 6 • Example: Tanimoto Similarity =4 / ( 6 + 6 – 4 ) = 0.5
Fingerprint similarity – Euclidean • Pythagorean distance • For binary dimensions, equivalent to the square root of the Hamming distance (i.e. square root of the number of bits that are different) • 0’s are treated as significant • Smaller values mean more similar • Example: 101101011 011101101 Different?xx xx Euclidean distance = sqrt(4) = 2.0
Sample dataset Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
Sample Dataset SMILES file • CC(=O)Nc1ccc(O)cc1 Acetaminophen • CC(C)NCC(O)COc1ccccc1CC=C Alprenolol • CC(N)Cc1ccccc1 Amphetamine • CC(CS)C(=O)N1CCCC1C(=O)O Captopril • CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine • OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac • NCC1(CC(=O)O)CCCCC1 Gabapentin • COC(=O)c1ccccc1O Salicylate
Oracle table Test for sample dataset Smiles Name LogP ------ ---- ---- CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(C)NCC(O)COc1ccccc1CC=C Alprenolol 2.81 CC(N)Cc1ccccc1 Amphetamine 1.76 CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 CN(C)CCCN1c2ccccc2Sc3ccc(Cl)cc13 Chlorpromazine 5.20 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60
DayCart structure search using SQL select * from Test where exact(Smiles, “CC(N)Cc1ccccc1”) = 1; Smiles Name LogP ------ ---- ---- CC(N)Cc1ccccc1 Amphetamine 1.76
DayCart substructure search select * from Test where matches(Smiles, “*C(=O)O”) = 1; Smiles Name LogP ------ ---- ---- CC(CS)C(=O)N1CCCC1C(=O)O Captopril 0.84 OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 NCC1(CC(=O)O)CCCCC1 Gabapentin -1.37 COC(=O)c1ccccc1O Salicylate 2.60
Substructure search for carboxylic acid Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
DayCart substructure / value search select * from Test where (matches(Smiles, “*C(=O)O”) = 1) AND (LogP > 1.0)); Smiles Name LogP ------ ---- ---- OC(=O)Cc1ccccc1Nc2c(Cl)cccc2Cl Diclofenac 4.02 COC(=O)c1ccccc1O Salicylate 2.60
DayCart similarity search Aspirin select * from TEST where tanimoto(SMILES, “CC(=O)Oc1ccccc1C(=O)O”) > 0.6; SMILES NAME LOGP ------ ---- ---- COC(=O)c1ccccc1O Salicylate 2.60 CC(=O)Nc1ccc(O)cc1 Acetaminophen 0.27 CC(N)Cc1ccccc1 Amphetamine 1.76
Similarity search for carboxylic acid Acetaminophen Alprenolol Amphetamine Captopril Chlorpromazine Diclofenac Gabapentin Salicylate
More examples of DayCart http://www.daylight.com/meetings/summerschool02/course/admin/daycart_hints.html