490 likes | 1.2k Views
SMILES. Simplified molecular input line entry specification. The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemical molecules using short ASCII strings
E N D
Simplified molecular input line entry specification The simplified molecular input line entry specification or SMILES is a specification for unambiguously describing the structure of chemicalmolecules using short ASCIIstrings SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules
SMILES • Simplified Molecular Input Line Entry System (SMILES) • Widely used AND computationally efficient • Uses atomic symbols and a set of intuitive rules • Uses hydrogen-suppressed molecular graphs (HSMG)
Canonical SMILES and Isomeric SMILES • The term Canonical SMILES refers to the version of the SMILES specification that includes rules for ensuring that each distinct chemical molecule has a single unique SMILES representation • A common application of Canonical SMILES is for indexing and ensuring uniqueness of molecules in a database • The term Isomeric SMILES refers to the version of the SMILES specification that includes extensions to support the specification of isotopes, chirality, and configuration about double bonds • A notable feature of these rules is that they allow rigorous partial specification of chirality.
Graph-based definition • In terms of a graph-based computational procedure, SMILES is a string obtained by printing the symbol nodes encountered in a depth-first tree traversal of a chemical graph • The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree • Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes • Parentheses are used to indicate points of branching on the tree
SINGLE* DOUBLE TRIPLE AROMATIC* * can be omitted - = # : SMILES Bonds
SMILES Branches • Represented by enclosure in parentheses • Can be nested or stacked • Examples: CC(O)CC is 2-Butanol OCC(C)C is iso-Butanol OC(C)(C)C is tert-Butanol
Ethene Chloroethene 1,1-Dichloroethene cis-1,2-Dichloroethene Trichloroethene Perchloroethene C=C ClC=C ClC(Cl)=C ClC=CCl ClC(Cl)=CCl ClC(Cl)=C(Cl)Cl SMILES Bonds
SMILES Symbols • String of alphanumeric characters and certain punctuation symbols • Terminates at the first space encountered when read left to right • The ORGANIC SUBSET: B, C, N, O, P, S, F, Cl, Br, I
Other SMILES Atoms • Aliphatic or nonaromatic carbon: C • Atom in aromatic ring: lowercase letter • Designate ring closure with pairs of matching digits, e.g. c1ccccc1 is Benzene, whereas C1CCCCC1 is Cyclohexane
SMILES Charges • Specify attached hydrogens and charges in square brackets • Number of attached hydrogens is the symbol H followed by optional digit
[H+] [OH-] [OH3+] [Fe++] [NH4+] proton hydroxyl anion hydronium cation iron(II) cation ammonium cation SMILES Charges
SMILES Cyclic Structures • Break one single or one aromatic bond in each ring • Number in any order • Designate ring-breaking atoms by the same digit following the atomic symbol
Cyclic Structures • Numbers indicate start and stop of ring • Same number indicates start and end of the ring, entered immediately following the start/end atoms • Only numbers 1 – 9 are used • A number should appear only twice • Atom can be associated w. 2 consecutive numbers, e.g., Napthalene: c12ccccc1cccc2
SMILES Conventions • Avoid two consecutive left parentheses if possible • Strive for the fewest number of possible branches • Tautomeric bonds are not designated; enter the appropriate form
Further Restrictions • A branch cannot begin a SMILES notation • A branch cannot immediately follow a double- or triple-bond symbol • Example: C=(CC)C is invalid, but • C(=CC)C or C(CC)=C are valid SMILES
Nitro Nitrate Nitrite Sulfonic acid Cyanide/Nitrile Azide Azido N(=O)(=O) ON(=O)(=O) ON(=O) S(=O)(=O)O C#N N=N#N N+=N- SMILES Fragments
SMILES Metals [Al] [As] [Au] [Be] [Bi] [Cd] [Ca] [Fe] [Hg] [K] [Li] [Mg] [Na] [Ni] [Pt] [Sb] [Sn] [Zn] [Zr]
Disconnected Structures • Tetramethyl ammonium bromide C[N+]C(C)C.[Br-]
Isomeric and Chiral SMILES • Isomeric configuration indicated by forward and backward slashes: / \ • Examples: • trans-1,2-dibromoethene: Br/C=C/Br • cis-1,2-dibromoethene: Br/C=C\Br • Chirality indicated by the “@” symbol
Another Application • SMILESCAS Database http://esc.syrres.com/interkow/smilecas.htm • Over 103,000 SMILES notations • Input CAS Registry Number • Leads to SMILES and thence to a structure search
Example 1 CC(C(C)(C)(Br))C