Computer Structure Codes (after lectures by Dr. J.M. Barnard)

Computer Structure Codes(after lectures by Dr. J.M. Barnard) • How do you store chemical structures on computer? • What can you do with them there? • How do the computer systems used in chemical informatics work?

Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types (aromatic ring identification) • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

Representinga chemicalstructure • How much information do you want to include? • atoms present • connections between atoms • bond types • stereochemical configuration • charges • isotopes • 3D-coordinates for atoms

2D structure diagram • chemists’ “natural language” • used by most computer systems for display • shows topology, optionally stereochemistry • several commonly-used computer programs allow input /editing of structure diagrams • ISIS/Draw (MDL) http://www.mdl.com • ChemDraw (CambridgeSoft) http://www.cambridgesoft.com/products/ • GRINS/JavaGRINS (Daylight) http://www.daylight.com/products/javatools.html

2D structure diagram • provides 2D pictorial representation of chemical structure • display on screen • cut/paste/embed in Word document etc. • inter-convert with other forms for further processing • database searching • structure analysis • property prediction • database analysis

Registry Numbers • unique identifiers for compounds or substances • catalog number • most chemical databases have them • Chemical Abstracts • Beilstein • private compound registries in pharmaceutical companies • usually just “idiot numbers” • no chemical information • may have hierarchical structure parent compound  stereoisomer  salt  batch • need to decide what is a separate compound

Line Notations • represent structures as compact linear string of alphanumeric symbols • easily handled by computer • compact storage • easily transmitted over a network • allow rapid manual coding/decoding by trained users • much faster for input than using a structure drawing program

Line Notations: SMILES Simplified Molecular Input Line Entry System • developed by Dave Weininger (Daylight) OC(=O)C(N)CC1=CC=C(O)C=C1

Other linenotations • ROSDAL (Beilstein) Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O • Sybyl Line Notation (Tripos) OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 • Wiswesser Line Notation (WLN) (obsolete) QVYZ1R DQ

Connection Tables (CTs) • main form of structure representation in computer systems • list atoms and bonds (and other data) as a table • many different formats • “internal” CTs (in memory) • algorithmic processing • “external” CTs (disk files) • archival storage • data exchange between programs

Internal Connection Table • usually “redundant” • every bond shown twice, once for each atom • implemented as array of records • record for each atom might store • atomic type • hydrogen count • formal charge • 2D display co-ordinates • bonds to neighboring atoms • etc.

“Redundant” Connection Table • O1 2 1 • C 0 1 1 3 2 4 1 • O 0 2 2 • C 12 1 5 1 6 1 • N2 4 1 • C2 4 1 7 1 • C0 6 1 8 2 12 1 • C 17 2 9 1 • C1 8 1 10 2 • C 0 9 2 11 1 13 1 • C 1 10 1 12 2 • C 1 11 2 7 1 • O 1 10 1

MDL Connection Table • proprietary file format developed by MDL • http://www.mdl.com/downloads/latest_releases/index.jsp • de facto standard for exchange of datasets • several different flavours and versions • Molfile (single molecule) • SDfile (set of molecules and data) • RGfile (Markush structure) • Rxnfile (single reaction) • RDfile (set of reactions with data) • separates atoms, bonds into separate blocks

Standard Connection Table Formats • different vendorshave proprietary CT formats • many attempts to establish agreed “standard” formats • no real general success • different user communities have failed to coordinate efforts • some standards exist in restricted areas • SMILES and MDL CT formats widely used • most popular programs read/write several different formats

Standard Connection Table Formats • Standard Molecular Data (SMD) format • never gained wide acceptance • Protein Data Bank (PDB) format • Crystallographic Information File (CIF) • Molecular Information File (MIF) • developed from SMD and compatible with CIF • Chemical Exchange Format (CXF) • Chemical Abstracts Service • Chemical Markup Language (CML) • for data exchange using the Internet • INChI (IUPAC/NIST Chemical Identifier)

Conclusions • There are lots of ways of storing a chemical structure in a computer • including different amounts of information • Most important ones are • line notations (e.g. SMILES) • connection tables (e.g. MDL Molfile) • nomenclature • Structure diagrams used for input/output

Topological Graph Theory • branch of mathematics • particularly useful in chemical informaticsand in computer science generally • study of “graphs” which consist of • a set of “nodes” • a set of “edges” joining pairs of nodes

Properties of graphs • graphs are only about connectivity • spatial position of nodes is irrelevant • length of edges are irrelevant • crossing edges are irrelevant

Structure Diagrams as Graphs • 2D structure diagrams very like topological graphs • atoms  nodes • bonds  edges • terminal hydrogen atoms are not normally shown as separate nodes (“implicit” H) • reduces number of nodes by ~50% • “hydrogen count” information used to colour neighbouring “heavy atom” atom • separate nodes sometimes used for “special” hydrogens • deuterium, tritium • hydrogen bonded to more than one other atom • hydrogens attached to stereocentres

Advantages of using graphs • mathematical theory is well understood • graphs can be easily represented in computers • many useful algorithms are known • identical graphs  identical molecules • different graphs  different molecules

Disadvantages of graphs • analogy between chemical structures and graphs is not perfect • identical graphs <=/=> identical molecules • different graphs <=/=> different molecules • realities of chemical structures cause problems • aromaticity stereochemistry • tautomerism coordination compounds • multi-centre bonds inorganic compounds • macromolecules polymers • incompletely-defined substances • many graph algorithms are inherently slow

Aromaticity • electronic property of certain ring systems, giving enhanced chemical stability • bonds in aromatic rings have properties that are distinct from single and double bonds • generally accepted definition is Hückel rule • 4n+2 pi-electrons (n is a small integer) • there are borderline cases • aromaticity causes problems for computer representation • different systems deal with it in different ways

Aromaticity problems • using single and double bonds can give different topological graphs for the same compound • one solution is to usean aromatic bond type

Alternating bonds and aromaticity • Chemical Abstracts Registry System uses a “normalised” bond type for all rings with alternating single and double bonds • this includes some systems that are not aromatic • and omits some that are

Representing aromaticity • some systems represent aromaticity as an atom property • SMILES allows use of lower-case atomic symbols for aromatic atoms (adjacent aromatic atoms are assumed to be joined by aromatic bonds) • problem: aromaticity is really a ring property

Tautomerism • dynamic equilibrium between positional isomers (labile H) • are they different compounds? • answer depends on what you want to do with them • can use normalised bondsto represent them by a single graph • gets mixed up with ringalternating bonds • some tautomers may bearomatic, when others are not

Tautomerism • tautomerism is a matter of degree • tautomers can be defined in different ways HQ–X=R  Q=X–RH only certain elements can be Q, X or R • keto-enol tautmersare not recognisedby Chemical Abstracts • mono-unsaturatedcarbon chains arenot distinguishedby Daylight

Structure conventions sometimes called “business rules” • some chemical groups can be shown in different but equally valid ways • conventions are needed to determine which is preferred • software may be needed to convert to preferred form

Stereochemistry • different compounds with identical connectivity • same topology, different topography S-tyrosine R-tyrosine

Stereochemistry • configuration is often unknown • or partially known (relative stereochemistry) • or you may have a mixture of stereoisomers • in which one isomer may occur in enantiomeric excess • many different descriptors used by chemists • wedge (up) and hatched (down) bonds in structure diagrams • Cahn, Ingold, Prelog (CIP) designators (R, S, E, Z) • text-based descriptors (stereoparent, or optical rotation)

Stereochemistry: up/down bonds • can be used as additional “colours” for graph edges • many connection table formats have special codes for up and down bonds • need to know which end of bond is which • useful for re-generating diagrams for display • can be used to calculate other stereo descriptors

Up/down bond problems • different patterns of up/down bonds can show the same stereoisomer • different graphs, same molecule • some patterns of up and down bonds actually convey no useful information about configuration

Stereochemistry: CIP designators • R.S. Cahn, C. Ingold, and V. Prelog, • Angewandte Chemie Intl. Ed. in English1966, 5, 385-551 • one-letter designator for stereocenters • based on rules assigning priorities to groups around it • tetrahedral carbons (R, S) • double bonds (E, Z) • additional colors for graph nodes or edges • useful for distinguishing stereoisomers when absolute configuration is known • less useful for matching parts of structures (substructure search) as priority rules can cause designator to change when remote part of structure is changed

Double bond stereo in SMILES / and \ used as “directional” single bonds • only meaningful when used on both atoms of a double bond • several ways of showing same configuration

Other complications • Organometallic and co-ordination compounds • complex stereochemistry • special bond types may be needed (dative bonds etc.) • ambiguity over covalent/ionic character of bonds • “business rules” rules usually needed • Inorganic compounds • topological representation often not possible • composition may not involve integral ratios between elements

Macromolecules • in principle can represent all atoms, as for small molecules • some systems use “shortcuts” or “superatoms” for subunits (e.g. amino acids)

Macromolecules • Each shortcut is defined with appropriate attachment points • ordinary atoms can bemixed with shortcuts • system can expandshortcuts when needed

Polymers • special problems are presented because properties of polymer can be affected by polymerisation conditions • average number of subunits • extent of cross-linking • ratio between different subunits • random / block sequences of subunits • etc. • Two main approaches • monomer representation • structural repeating unit (SRU) representation

Incompletely-defined substances • unknown stereochemistry • unknown attachment position • unknown repetition

Markush (“Generic”) structures • structures with R-groups • shorthand for describing sets of structures with common features

Markush structures • also called “generic” structures • very important in chemical patents • inventor claims whole class of related compounds • can be used to describe combinatorial libraries • can be used as queries in database searches

Canonicalization • a given chemical structure (or graph) can have many valid and unambiguous representations • different order of rows in connection table • different order of atoms in SMILES • for comparison purposes it would be useful to have a single unique or “canonical” representation • process of converting input representation to canonical form is called “canonicalization” or “canonization” • process of applying “rules” (i.e. an algorithm)

Canonicalization • an obvious approach: • generate all possible valid SMILES • choose the one that comes first alphabetically • this would be very slow, but effective, and there is a danger of missing one • principle was used for canonicalizing Wiswesser Line Notation

Canonicalization • most methods in use today involve renumbering the atoms in some unique and reproducible way • can be used to number rows in connection table • can determine order of atoms in SMILES • normally involve a node labelling technique called “relaxation” • example is Morgan’s algorithm (1965)

Symmetry perception • if ties between label values cannot beresolved on basis of atom/bond types, the atoms are symmetrically equivalent, andit doesn’t matter which is chosen next • Morgan’s algorithm is thus also useful for identifying symmetry in molecules

Morgan’s algorithm • Works by taking more of the graph into account at each iteration • essence of “relaxation” technique is iteratively updating a value by looking at its immediate neighbours • It is not infallible • graphs (“isospectral” graphs) are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent • There are many variations on it • and several theoretical papers analysing it mathematically

Ring perception • How many rings are there in these structures and which ones are they? • rings are important features of chemical structures • nomenclature generation • aromaticity perception • synthetic significance • fragment descriptor generation

Computer Structure Codes (after lectures by Dr. J.M. Barnard)