470 likes | 692 Views
New MDL File Formats. Since lecture on Oct 28, MDL have published details of a new file format ?XDfile"XML-based data format for transferring structure/reaction information with associated databuilt around existing MDL connection table formats can incorporate Chime strings (encrypted format used
E N D
1. Chemical Structure Representation and Search SystemsLecture 4. Nov 11, 2003 John Barnard
Barnard Chemical Information Ltd
Chemical Informatics Software & Consultancy Services
Sheffield, UK
2. New MDL File Formats Since lecture on Oct 28, MDL have published details of a new file format “XDfile”
XML-based data format for transferring structure/reaction information with associated data
built around existing MDL connection table formats
can incorporate Chime strings (encrypted format used to render structures and reactions on a Web page)
can incorporate SMILES strings
New download site:
http://www.mdl.com/downloads/public/ctfile/ctfile.jsp
3. Lecture 4: Topics to be Covered full structure search
structure registration systems
graph isomorphism
algorithm complexity and NP-complete problems
substructure search
subgraph isomorphism
screening searches and fingerprints
substructure query formulation
SMARTS
commercial systems
4. Full Structure and Substructure Search full structure search
query is is complete molecule
is this molecule in the database?
or tautomers, stereoisomers etc. of it,
substructure search
query is a pattern of atoms and bonds
does this pattern occur as a substructure of any of the molecules in my database?
superstructure search
query is a complete molecule
are any of the molecules in the database substructures of it?
N.B. Some Daylight Chemical Information Systems Inc. documentation uses “substructure” and “superstructure” search in the opposite sense to those given here
5. Full Structure Search Many databases contain millions of structures, so search speed is important
Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES)
could sort database SMILES into alphanumerical order
search sorted list for match with query
“Hash table” lookup can improve search speed
calculate hash-code (“idiot number” in predefined range) from SMILES for each database structure
this is address (disk file or memory) at which full representation is stored
only SMILES which have same hash code need to be compared
6. Structure Registration Systems Many chemical and pharmaceutical companies maintain compound “registry” systems
database of all compounds worked on internally
may included many compounds never published elsewhere (i.e. not in Chemical Abstracts, Beilstein)
links to company reports, biological screening data, stock number in compounds store etc.
links to electronic lab notebooks, LIMS (Lab. Info. Management System), ORACLE database etc.
7. Structure Registration Systems new compounds need to be added regularly
used to be done by chemical information specialists
now frequently done directly by bench chemists
registration system must
check consistency of input data
e.g. compare molecular formula with structure
check that compound is really new
different ways of handling tautomers, salts, stereoisomers etc.
assign registry number
add supplementary data (melting point etc.)
make data immediately available for search
8. Structure Registration Systems “Public” databases use same principles, adding compounds from published literature
Chemical Abstracts Registry file
links to document where data on molecule was published
Beilstein Registry file
lots of data may be stored with compound, from different data sources; existing records may need updating
Updates for searching may be made available at regular intervals (weekly, monthly, annually, etc.)
9. Graph Isomorphism In graph theory terms, when two full structures match, their graphs are said to be isomorphic
each node N1 in G1 must be mapped to a node N2 in G2
neighbours of N1 must map to neighbours of N2
10. Graph isomorphism by brute force for each node in G1
map it against an unmapped node in G2
check that neighbours of each node map appropriately in the two graphs
if each graph has n nodes there are n! ways of doing this
n × (n-1) × (n-2) × (n-3) … × 3 × 2 × 1
this is a big number if n is anything non-trivial
9! = 362 880
10! = 3 628 800
11. Computational complexity a measure of how long a computational algorithm will take to run, depending on the size of input
if you give it twice as much data will it take twice as long to run?
e.g. comparing a word sequentially against each member of a list of words of length n
take taken depends directly on length of list
algorithm is O(n) [“order-n”]
e.g. comparing each word in a list of length n with every other word of the same list
algorithm is O(n2) [“order-n-squared”]
12. Computational complexity some algorithms may have complexity O(n3), O(n4), O(log n), O(n log n) etc.
these are all “polynomial” time algorithms
some algorithms have exponential complexity, e.g. O(2n)
this is much slower than polynomial
brute-force graph isomorphism is O(n!)
this is even slower than simple exponential
13. Computational complexity for some problems you can find more efficient algorithms (lower order of complexity) to do the same thing
e.g. searching a sorted list
simple “sequential” search is O(n)
“binary chop” search is O(log n)
for some problems there are no known polynomial-time algorithms
14. NP-complete problems a class of problems for which no polynomial-time algorithms are known
problems in this class are mathematically “equivalent”
if a polynomial time algorithm could be found for one of them, it would fit all of them
well-known example is “travelling salesman problem” (shortest path visiting each of several cities)
it is suspected (but not proven) that no polynomial-time algorithms can exist for this class of problems
15. NP-complete problems graph isomorphism is probably NP-complete (not rigorously proven)
subgraph isomorphism is a generalisation of graph isomorphism
nodes in G1 (query structure) must be mapped to subset of nodes in G2 (database structure)
i.e. G1 is a subgraph G2
subgraph isomorphism has been proven to be NP-complete
substructure searching is inherently slow
16. Subgraph isomorphism NP-completeness of problem means that worst-case match times are exponential in number of atoms involved
but average-case match times can be better than this
much effort has been expended on this problem over the past 40+ years
closely-related problems remain an active area of research
17. Speeding up subgraph isomorphism use a faster computer
use tricks to avoid exploring potential solutions that are bound to fail
do most of the work in a pre-processing of the database structures, independently of the query
18. Speeding up subgraph isomorphism chemical graphs have several characteristics that allow heuristics (“tricks”) to be used to speed up isomorphism identification
several different node and edge labels
low connectivity of each node
using hydrogen-suppressed graphs reduces size of problem (number of nodes)
these tricks would be of less use for general graphs
additional tricks and algorithms may be used in special cases (e.g. if graphs are trees)
19. Backtracking modification of the brute-force approach
abandons partial solutions part-way through when it can be seen they are bound to fail
worst-case is still exponential in number of nodes, but doesn’t arise very often
first map an arbitrary pair of nodes
then map neighbours of these nodes
if successful, map neighbours of each neighbour, etc.
if not, backtrack one step, and try a different mapping
20. Backtracking algorithm will terminate
when all query nodes are mapped [MATCH]
when all alternative mappings for first query node have been tried, and have failed [NO MATCH]
extra tricks can be used for further improvement
only map nodes with same element type and charge, and compatible bonding patterns
start with unusual atom types, and nodes with lots of neighbours
21. Partitioning and Relaxation often used as an adjunct to backtracking
start by partitioning the nodes into sets of possible correspondents
e.g. nitrogens can only match nitrogens
iteratively refine the partition on basis of other possible correspondences
e.g. if F6 is only possible correspondent for Q1 then F6 cannot be a correspondent for Q2
if the list of possible correspondents for a query node becomes empty, there is no isomorphism
22. Partitioning and Relaxation can also reduce lists of possible correspondents by looking at neighbours
if F6 is to remain a valid correspondent of Q1, then the neighbours of F6 must be possible correspondents of the neighbours of Q1
as this check is repeated for each node, we are bringing in information from further away, but only ever looking at immediate neighbours
this technique is the same as Morgan’s algorithm for node labelling in canonicalisation
it is called relaxation
backtracking can be used as a fallback when no further reductions can be made in the lists of possible correspondents
23. Subgraph isomorphism algorithms Ray and Kirsch’s algorithm (1957)
basic backtracking
Sussenguth’s partitioning algorithm (1965)
relaxation technique called “connectivity property”, with backtracking as fall-back
Figueras’s set reduction algorithm (1972)
Ullmann’s algorithm (1976)
efficient relaxation and backtracking
von Scholley’s relaxation algorithm (1984)
24. Screening so far we’ve considered matching one query substructure against one database full structure
each structure from the database needs to be compared against the query in turn
many will fail because they don’t contain the query substructure
“screening” allows many of these to be eliminated before we get to this stage
uses structure “fingerprints” discussed in lecture 3
25. Fingerprints the fragments present in a structure can be represented as a sequence of 0s and 1s
00010100010101000101010011110100
0 means fragment is not present in structure
1 means fragment is present in structure (perhaps multiple times)
each 0 or 1 can be represented as a single bit in the computer (a “bitstring”)
for chemical structures often called structure “fingerprints”
26. Screening build a fingerprint for the query substructure
only those database structures that contain all the fragments in the query can possibly match the query
Query: 00000100010101000001010011010100
DB struct 1: 00010100010101000101010011110100 MATCH
DB struct 2: 00000000100101001001000011100000 NO MATCH
comparing fingerprint bitstrings is very fast (logical AND operation)
only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search
27. Screening can be made faster by “inverting” the bit strings (actually, turning them on their side)
instead of a bitstring of fragments for each structure ...
store a bitstring of structures for each fragment
each bit represents a database structure
1=structure contains fragment; 0=structure does not
search by ANDing together the bitstrings for the fragments present in the query
this will list those structures that contain all the query’s fragments
28. Screening Effectiveness Ideally we want to eliminate as many structures as possible at the screening stage
99% screenout or more would be good
Fingerprint construction can help in this
frequency distributions of fragments in a large database are very “skewed”
a few fragments occur in almost all compounds
will therefore give little or no screenout
many fragments occur in very few compounds
need very long fingerprint (lots of fragments) to ensure that we will have some in the query
29. Fingerprint construction best fragments are medium frequency ones
fragments also need to be independent of each other
dictionary used for CA Registry search was constructed on basis of analysis of fragment frequency distributions
30. Daylight fingerprints each fragment is used to generate a hash code, which specifies the bit position to be set
actually most fragments set several bits
small fragments (more frequent) set fewer bits
larger fragments (less frequent) set more bits
several different fragments may set the same bit
in principle this can reduce screening effectiveness
it may allow a fingerprint to match when structure does not contain the same fragment as the query
in practice is not a serious problem
it will never cause a structure to be rejected when it is actually a match
31. Daylight fingerprints fingerprint can be “folded” to reduce length of fingerprint
again this increases the chances of false matches, but with “sparse” (low-density) fingerprints this is more than offset by the increase in search speed
32. Hardware solutions “use a faster computer”
cheaper memory means that a lot of operations can be performed in memory
in Daylight, fingerprints are stored and matched in memory
parallel processing
database parallel (split database over several machines)
algorithm-parallel (different operations on same structure on different processors)
33. Parallel processing Chemical Abstracts Registry File
different machines search different parts of the database
results are collated for presentation to user
other research work has looked at various algorithms and various processors
speedup declines as more processors added
overheads in controlling them become dominant
34. Parallel processing subgraph isomorphism algorithms are not very suitable for algorithm parallelisation
individual operations are very simple
von Scholley algorithm designed for parallel implementation
each processor handles one atom in relaxation step
problem is distributing data to processors
most processors spend time waiting for next data
35. Preprocessing the database Do the time-consuming work in advance
Full structure search provides an example of this
Canonicalisation is a slow process (NP-complete)
but it can be done in a pre-processing of the file, independently of the query
then store the canonical representations
can do rapid matches against a canonicalised query structure
this is faster than using a graph isomorphism algorithm on non-canonical representations
36. Preprocessing the database Similar principles are used in some substructure search systems
A tree structure is built, classifying all the atoms found in all the structures in the database
first level based on atom type
second level based on number of connections
third level based on type of first neighbour
fourth level based on type of second neighbour
etc.
lower levels based on classifications applied to neighbours (relaxation)
bottom of tree lists structures that contain this class of atom
37. Tree-structured fragment searches
38. Tree-structured fragment search search can be done by tracing tree, looking for atom classes found in query
combine lists of structures found at the bottom
a backtracking atom-by-atom search may be needed to check hits found
best-known example is Beilstein’s Crossfire
main problem is updating the trees when new structures are added to the database
39. Substructure queries queries for substructure search systems may be more complicated than simple subgraphs
different systems provide different capabilities
variable atom and bond types
specification ofallowed substitution
40. Substructure queries some systems provide very complex query options
41. Substructure queries: SMARTS Daylight uses an extension of SMILES to describe structure queries (SMARTS)
can attach various properties to each atom
[CX3] carbon with 3 connections
[Nr5] nitrogen in a ring of size 5
properties can be combined with logical operators
! (NOT)
& (AND – high precedence)
, (OR)
; (AND – low precedence)
42. SMARTS complex patterns can be specified this way:
[F,Cl, Br, I] any of the halogen atoms
[!C;!R0] heteroatom in a ring
$(smarts_string) can also be used as an atom property
this is called recursive SMARTS
e.g. $(NC=*)
nitrogen single-bond carbon double-bond any-atom
(i.e. an amide)
43. SMARTS recursive SMARTS can be used to describe very complex patterns
e.g. primary or secondary amine, but not amide
44. Commercial systems Several software companies provide structure registration and search systems to the chemical/pharmaceutical industry
MDL Information Systems Inc.
MACCS, ISIS
Daylight Chemical Information Systems Inc.
THOR, MERLIN, DayCart (Oracle cartridge)
IDBS ActivityBase
Accelrys Accord Enterprise Informatics
replaces Oxford Molecular RS3
45. Conclusions from Lecture 4 structure matching is an NP-complete problem
worst-case time requirements rise exponentially with number of atoms involved
heuristics (tricks) can be used to improve average search speed
several algorithms have been published
most use partitioning and relaxation techniques
fingerprint screening can rapidly eliminate the bulk of non-matching structures
different systems allow different degrees of sophistication in formulating search queries
46. Further Reading J. M. Barnard, “Substructure searching methods: old and new”, J. Chem. Inf. Comput. Sci., 1993, 33, 532-538
J. Xu. “Two dimensional structure and substructure searching.” In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol. 2, pp. 868-884, Wiley-VCH, 2003
47. Lecture 5: More structure searching Searching Markush structures in patents
nature and origin of Markush structures
fragment codes
topological systems (MARPAT, Markush DARC)
Reaction searching
atom-atom mapping
Maximal Common Substructure search
what is the largest substructure common to two molecules?