E N D
1. Chemical Structure Representation and Search Systems John Barnard
Barnard Chemical Information Ltd
Chemical Informatics Software & Consultancy Services
Sheffield, UK
3. Graph Terminology degree of a node
number of edges meeting at it
leaf node
a node of degree 1
path
connected sequence of edges between two nodes
4. Graph Terminology cycle
path which returns to its starting node
tree
graph with no cycles
subgraph
graph containing a subset of the nodes and edges of another graph
5. Graph Terminology spanning tree
a tree subgraph thatcontains all the nodes(but not necessarilyall the edges) of a graph
6. Graph Terminology connected graph
graph in which there is a path between every pair of nodes
fully-connected graph
graph in which there is an edge between every pair of nodes(all nodes have degree n-1)
7. Graph Terminology disconnected graph
graph in which some pairs of nodes have no path betweenthem
component
subgraph in which all pairs of nodes are linked by a path, but no node has a path to a node in another component
8. Graph Terminology forest
graph containing twoor more components that are trees
9. Canonicalisation a given chemical structure (or graph) can have many valid and unambiguous representations
different order of rows in connection table
different order of atoms in SMILES
for comparison purposes it would be useful to have a single unique or “canonical” representation
process of converting input representation to canonical form is called “canonicalisation” or “canonisation”
process of applying “rules” (i.e. an algorithm)
10. Canonicalisation an obvious approach:
generate all possible valid SMILES
choose the one that comes first alphabetically
this would be very slow, but effective, and there is a danger of missing one
principle was used for canonicalising Wiswesser Line Notation
11. Canonicalisation most methods in use today involve renumbering the atoms in some unique and reproducible way
can be used to number rows in connection table
can determine order of atoms in SMILES
normally involve a node labelling technique called “relaxation”
example is Morgan’s algorithm (1965)
12. Morgan’s algorithm Label each node with its degree
Count number ofdifferent values
13. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
14. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
Repeat fromstep 3
15. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
Repeat fromstep 3
16. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
Repeat fromstep 3
17. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
Repeat fromstep 3
18. Morgan’s algorithm Recalculate labelsby summing labelvalues at neighbournodes
Count number ofdifferent values
Repeat fromstep 3 until thereis no increase in thenumber of differentvalues
19. Morgan’s algorithm most nodes nowhave differentlabels
choose node withhighest label asnode 1
number its neighbours in orderof label values
20. Morgan’s algorithm most nodes nowhave differentlabels
choose node withhighest label asnode 1
number its neighbours in orderof label values
21. Morgan’s algorithm move to node 2
number its remaining neighbours in orderof label values
because label valuesare tied, choose one with higher bond order (green) first
move to node 3
22. Morgan’s algorithm continue till all nodesare numbered
we now have a numbering for the rowsof the connection table
“breadth-first” trace
nodes are dealt with in a “queue” (first in,first out)
23. Morgan’s algorithm continue till all nodesare numbered
we now have a numbering for the rowsof the connection table
“breadth-first” trace
nodes are dealt with in a “queue” (first in,first out)
24. Morgan’s algorithm “depth-first” trace isalso possible
nodes are dealt with ina “stack” (last in, first out)
more suitable for assigningatom numbers in SMILES where we want consecutivenumbers to form a path
OC(=O)C(N)CC1C=CC(O)=CC=1
25. Symmetry perception if ties between label values cannot beresolved on basis of atom/bond types, the atoms are symmetrically equivalent, andit doesn’t matter which is chosen next
Morgan’s algorithm is thus also useful for identifying symmetry in molecules
26. Morgan’s algorithm Provides canonical numbering for the nodes in a graph that doesn’t depend on any original numbering
Works by taking more of the graph into account at each iteration
essence of “relaxation” technique is iteratively updating a value by looking at its immediate neighbours
It is not infallible
some graphs are known where the algorithm cannot distinguish nodes that are not symmetrically equivalent
There are many variations on it
and several theoretical papers analysing it mathematically
O. Ivanciuc, “Canonical numbering and constitutional symmetry”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
27. Canonicalisation Algorithms are applied to graphs not chemical structures
Issues such as aromaticity, tautomerism and stereochemistry need to be addressed before canonical numbering of the graph
Daylight’s canonicalisation algorithm for SMILES perceives aromatic rings (using its own definition of aromaticity) as first step
28. Ring perception How many rings are there in these structures and which ones are they?
rings are important features of chemical structures
nomenclature generation
aromaticity perception
synthetic significance
fragment descriptor generation
29. Rings and ring systems A ring system is a subgraph in which every edge is part of a cycle
30. Ring perception Euler Relationship
nodes + rings = edges + components
where rings is the number of edges that must be removed from the graph to turn it into a tree
rings is also called the Frerejacques number or nullity
this is the minimum possible number of rings; it may be useful to identify others
31. Which rings to perceive? Usually the smallest set of smallest rings
two 6-membered rather thanone 6- and one 10-membered
two 5-membered rather thanone 5- and one 6-membered
But there may be more than one SSSR
C-S-C-C-C-C
C-C-C-C-O-C
C-S-C-C-O-C
three different 6-membered rings
32. Which rings to perceive? Sometimes a large envelopering may be aromatic, whensmaller rings are not
Ring perception is a complex area where there are no right answers
there is a lot of literature on the subject
33. Ring perception by spanning tree start at an arbitrary node
“grow a spanning tree”
add neighbours of current node to a queue
provided they are not already in it
move to the next node in the queue
repeat until queue is empty
those edges from original graph not in the spanning tree are ring closures
34. Substructure Fragments Subgraphs can be identified in a structure graph corresponding to functional groups, rings etc.
–OH
–NH2
–COOH
phenyl
this can be done bytracing appropriatepaths in the graph
subgraphs may overlap
35. Substructure Fragments More systematic subgraphs can also be identified(easier to do algorithmically)
paths of connected atoms
every atom and its immediate neighbours
rings
Subgraphs can overlap
(it’s difficult to showpictures with atoms inseveral colours at once!)
36. Substructure fragments fragments provide “index terms” for a chemical structure
analogous to keywords in a text document
they can be used in searching for structures
retrieved structures must contain the same fragments as the query
“ambiguous” representations
many different structures can have the same fragments, connected together in different ways
fragments to be used may be a closed list
controlled “vocabulary” (dictionary) of structural features
or an open-ended list (like free text searching)
e.g. all unbranched paths of up to 6 atoms
37. Fragment codes many early chemical information systems were based on identifying fragments of this sort
originally the fragments were identified manually
and represented on punched cards
special fragment codes (dictionaries of fragments) were devised for different systems
some of these are still in use, though with automated encoding of structures
particularly important are the systems for “Markush” structures in patents (e.g. Derwent WPI code)
38. Fingerprints the fragments present in a structure can be represented as a sequence of 0s and 1s
00010100010101000101010011110100
0 means fragment is not present in structure
1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)
for chemical structures often called structure “fingerprints”
39. Fingerprints fingerprints are typically 150-2500 bits long
where a fixed dictionary of fragments is used there can be a 1:1 relationship between fragment and bit position in fingerprint
sometimes several related fragments will “set” the same bit
disadvantage is that if structure contains no fragments from the dictionary, no bits are set
can be avoided if “generalised” fragments are used(involving e.g. “any atom”, “any ring bond” types)
40. Fingerprints if fragment set is open-ended, the fragment description (e.g. C-C-N-C-C-O) can be “hashed” to a number in fixed range (e.g. 1 to 1024) and this is the bit number to be set
disadvantages:
different and unrelated fragments may “collide” at the same bit position
difficult to work back from bit position to fragment
this usually causes only slight degradation in search performance (false hits), but can be more of a problem in other applications of fingerprints
41. Fingerprints Hashed fingerprints
typically used in software from Daylight Chemical Information Systems Inc.
Dictionary fingerprints
Chemical Abstracts Service
MDL Information Systems Inc
ISIS or MACCS keys (166 and 960 bits)
Barnard Chemical Information Ltd
customised dictionaries
42. 2D structure depiction if structures are stored without 2D display coordinates, we need to generate them
SMILES
“depiction” algorithms are used for this
identify and lay out ring systems first
complications over orientation of some systems
Chemical Abstracts stores “standard depictions” of all ring systems it has encountered
then add side chains, avoiding collisions
many features can be added to improve appearance
43. 3D structure depiction much more complicated than 2D
need to store standard bond lengths and angles
need to distinguish atoms in different hybridisation states (sp2 vs sp3 carbon)
need rotate single bonds to avoid “bumps”
sophisticated “conformation generation” programs identify low-energy conformers
very useful for identifying molecules with the correct shape to fit into biological receptor sites
J. Sadowski, “3D structure generation”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 231-261. Wiley, 2003
44. Nomenclature generation most systematic nomenclature is based on ring systems
need to identify/prioritise ring systems first
identify standard numbering for system
frequently need to store this
add side chains and substituents with appropriate locants
J. L. Wisniewski, “Chemical nomenclature and structure representation: algorithmic generation and conversion”, in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
45. Conclusions from Lecture 3 there are several important jargon terms used in graph theory, which crop up in chemical informatics
canonicalisation provides a unique numbering for the atoms in a molecule
Morgan algorithm can be used to achieve it
it’s not always obvious how many rings there are, or which ones they are
fingerprints represent the presence or absence of substructure fragments in a molecule
they are ambiguous representations of structure
46. Topic for Lecture 4: Structure searching two main varieties of search
full structure search
query is is complete molecule
is this molecule in the database?
or tautomers, stereoisomers etc. of it,
substructure search
query is a pattern of atoms and bonds
does this pattern occur as a substructure (subgraph) of any of the molecules in my database?