Chemical Structure Representation and Search Systems Lecture 4. Nov 11, 2003

1. Chemical Structure Representation and Search SystemsLecture 4. Nov 11, 2003 John Barnard Barnard Chemical Information Ltd Chemical Informatics Software & Consultancy Services Sheffield, UK

2. New MDL File Formats Since lecture on Oct 28, MDL have published details of a new file format �XDfile� XML-based data format for transferring structure/reaction information with associated data built around existing MDL connection table formats can incorporate Chime strings (encrypted format used to render structures and reactions on a Web page) can incorporate SMILES strings New download site: http://www.mdl.com/downloads/public/ctfile/ctfile.jsp

3. Lecture 4: Topics to be Covered full structure search structure registration systems graph isomorphism algorithm complexity and NP-complete problems substructure search subgraph isomorphism screening searches and fingerprints substructure query formulation SMARTS commercial systems

4. Full Structure and Substructure Search full structure search query is is complete molecule is this molecule in the database? or tautomers, stereoisomers etc. of it, substructure search query is a pattern of atoms and bonds does this pattern occur as a substructure of any of the molecules in my database? superstructure search query is a complete molecule are any of the molecules in the database substructures of it? N.B. Some Daylight Chemical Information Systems Inc. documentation uses �substructure� and �superstructure� search in the opposite sense to those given here

5. Full Structure Search Many databases contain millions of structures, so search speed is important Simplest approaches uses canonical representation for query and database structures (e.g. canonical SMILES) could sort database SMILES into alphanumerical order search sorted list for match with query �Hash table� lookup can improve search speed calculate hash-code (�idiot number� in predefined range) from SMILES for each database structure this is address (disk file or memory) at which full representation is stored only SMILES which have same hash code need to be compared

6. Structure Registration Systems Many chemical and pharmaceutical companies maintain compound �registry� systems database of all compounds worked on internally may included many compounds never published elsewhere (i.e. not in Chemical Abstracts, Beilstein) links to company reports, biological screening data, stock number in compounds store etc. links to electronic lab notebooks, LIMS (Lab. Info. Management System), ORACLE database etc.

7. Structure Registration Systems new compounds need to be added regularly used to be done by chemical information specialists now frequently done directly by bench chemists registration system must check consistency of input data e.g. compare molecular formula with structure check that compound is really new different ways of handling tautomers, salts, stereoisomers etc. assign registry number add supplementary data (melting point etc.) make data immediately available for search

8. Structure Registration Systems �Public� databases use same principles, adding compounds from published literature Chemical Abstracts Registry file links to document where data on molecule was published Beilstein Registry file lots of data may be stored with compound, from different data sources; existing records may need updating Updates for searching may be made available at regular intervals (weekly, monthly, annually, etc.)

9. Graph Isomorphism In graph theory terms, when two full structures match, their graphs are said to be isomorphic each node N1 in G1 must be mapped to a node N2 in G2 neighbours of N1 must map to neighbours of N2

10. Graph isomorphism by brute force for each node in G1 map it against an unmapped node in G2 check that neighbours of each node map appropriately in the two graphs if each graph has n nodes there are n! ways of doing this n � (n-1) � (n-2) � (n-3) � � 3 � 2 � 1 this is a big number if n is anything non-trivial 9! = 362 880 10! = 3 628 800

11. Computational complexity a measure of how long a computational algorithm will take to run, depending on the size of input if you give it twice as much data will it take twice as long to run? e.g. comparing a word sequentially against each member of a list of words of length n take taken depends directly on length of list algorithm is O(n) [�order-n�] e.g. comparing each word in a list of length n with every other word of the same list algorithm is O(n2) [�order-n-squared�]

12. Computational complexity some algorithms may have complexity O(n3), O(n4), O(log n), O(n log n) etc. these are all �polynomial� time algorithms some algorithms have exponential complexity, e.g. O(2n) this is much slower than polynomial brute-force graph isomorphism is O(n!) this is even slower than simple exponential

13. Computational complexity for some problems you can find more efficient algorithms (lower order of complexity) to do the same thing e.g. searching a sorted list simple �sequential� search is O(n) �binary chop� search is O(log n) for some problems there are no known polynomial-time algorithms

14. NP-complete problems a class of problems for which no polynomial-time algorithms are known problems in this class are mathematically �equivalent� if a polynomial time algorithm could be found for one of them, it would fit all of them well-known example is �travelling salesman problem� (shortest path visiting each of several cities) it is suspected (but not proven) that no polynomial-time algorithms can exist for this class of problems

15. NP-complete problems graph isomorphism is probably NP-complete (not rigorously proven) subgraph isomorphism is a generalisation of graph isomorphism nodes in G1 (query structure) must be mapped to subset of nodes in G2 (database structure) i.e. G1 is a subgraph G2 subgraph isomorphism has been proven to be NP-complete substructure searching is inherently slow

16. Subgraph isomorphism NP-completeness of problem means that worst-case match times are exponential in number of atoms involved but average-case match times can be better than this much effort has been expended on this problem over the past 40+ years closely-related problems remain an active area of research

17. Speeding up subgraph isomorphism use a faster computer use tricks to avoid exploring potential solutions that are bound to fail do most of the work in a pre-processing of the database structures, independently of the query

18. Speeding up subgraph isomorphism chemical graphs have several characteristics that allow heuristics (�tricks�) to be used to speed up isomorphism identification several different node and edge labels low connectivity of each node using hydrogen-suppressed graphs reduces size of problem (number of nodes) these tricks would be of less use for general graphs additional tricks and algorithms may be used in special cases (e.g. if graphs are trees)

19. Backtracking modification of the brute-force approach abandons partial solutions part-way through when it can be seen they are bound to fail worst-case is still exponential in number of nodes, but doesn�t arise very often first map an arbitrary pair of nodes then map neighbours of these nodes if successful, map neighbours of each neighbour, etc. if not, backtrack one step, and try a different mapping

20. Backtracking algorithm will terminate when all query nodes are mapped [MATCH] when all alternative mappings for first query node have been tried, and have failed [NO MATCH] extra tricks can be used for further improvement only map nodes with same element type and charge, and compatible bonding patterns start with unusual atom types, and nodes with lots of neighbours

21. Partitioning and Relaxation often used as an adjunct to backtracking start by partitioning the nodes into sets of possible correspondents e.g. nitrogens can only match nitrogens iteratively refine the partition on basis of other possible correspondences e.g. if F6 is only possible correspondent for Q1 then F6 cannot be a correspondent for Q2 if the list of possible correspondents for a query node becomes empty, there is no isomorphism

22. Partitioning and Relaxation can also reduce lists of possible correspondents by looking at neighbours if F6 is to remain a valid correspondent of Q1, then the neighbours of F6 must be possible correspondents of the neighbours of Q1 as this check is repeated for each node, we are bringing in information from further away, but only ever looking at immediate neighbours this technique is the same as Morgan�s algorithm for node labelling in canonicalisation it is called relaxation backtracking can be used as a fallback when no further reductions can be made in the lists of possible correspondents

23. Subgraph isomorphism algorithms Ray and Kirsch�s algorithm (1957) basic backtracking Sussenguth�s partitioning algorithm (1965) relaxation technique called �connectivity property�, with backtracking as fall-back Figueras�s set reduction algorithm (1972) Ullmann�s algorithm (1976) efficient relaxation and backtracking von Scholley�s relaxation algorithm (1984)

24. Screening so far we�ve considered matching one query substructure against one database full structure each structure from the database needs to be compared against the query in turn many will fail because they don�t contain the query substructure �screening� allows many of these to be eliminated before we get to this stage uses structure �fingerprints� discussed in lecture 3

25. Fingerprints the fragments present in a structure can be represented as a sequence of 0s and 1s 00010100010101000101010011110100 0 means fragment is not present in structure 1 means fragment is present in structure (perhaps multiple times) each 0 or 1 can be represented as a single bit in the computer (a �bitstring�) for chemical structures often called structure �fingerprints�

26. Screening build a fingerprint for the query substructure only those database structures that contain all the fragments in the query can possibly match the query Query: 00000100010101000001010011010100 DB struct 1: 00010100010101000101010011110100 MATCH DB struct 2: 00000000100101001001000011100000 NO MATCH comparing fingerprint bitstrings is very fast (logical AND operation) only those structures that pass the screening stage need to be considered as candidates for atom-by-atom isomorphism search

27. Screening can be made faster by �inverting� the bit strings (actually, turning them on their side) instead of a bitstring of fragments for each structure ... store a bitstring of structures for each fragment each bit represents a database structure 1=structure contains fragment; 0=structure does not search by ANDing together the bitstrings for the fragments present in the query this will list those structures that contain all the query�s fragments

28. Screening Effectiveness Ideally we want to eliminate as many structures as possible at the screening stage 99% screenout or more would be good Fingerprint construction can help in this frequency distributions of fragments in a large database are very �skewed� a few fragments occur in almost all compounds will therefore give little or no screenout many fragments occur in very few compounds need very long fingerprint (lots of fragments) to ensure that we will have some in the query

29. Fingerprint construction best fragments are medium frequency ones fragments also need to be independent of each other dictionary used for CA Registry search was constructed on basis of analysis of fragment frequency distributions

30. Daylight fingerprints each fragment is used to generate a hash code, which specifies the bit position to be set actually most fragments set several bits small fragments (more frequent) set fewer bits larger fragments (less frequent) set more bits several different fragments may set the same bit in principle this can reduce screening effectiveness it may allow a fingerprint to match when structure does not contain the same fragment as the query in practice is not a serious problem it will never cause a structure to be rejected when it is actually a match

31. Daylight fingerprints fingerprint can be �folded� to reduce length of fingerprint again this increases the chances of false matches, but with �sparse� (low-density) fingerprints this is more than offset by the increase in search speed

32. Hardware solutions �use a faster computer� cheaper memory means that a lot of operations can be performed in memory in Daylight, fingerprints are stored and matched in memory parallel processing database parallel (split database over several machines) algorithm-parallel (different operations on same structure on different processors)

33. Parallel processing Chemical Abstracts Registry File different machines search different parts of the database results are collated for presentation to user other research work has looked at various algorithms and various processors speedup declines as more processors added overheads in controlling them become dominant

34. Parallel processing subgraph isomorphism algorithms are not very suitable for algorithm parallelisation individual operations are very simple von Scholley algorithm designed for parallel implementation each processor handles one atom in relaxation step problem is distributing data to processors most processors spend time waiting for next data

35. Preprocessing the database Do the time-consuming work in advance Full structure search provides an example of this Canonicalisation is a slow process (NP-complete) but it can be done in a pre-processing of the file, independently of the query then store the canonical representations can do rapid matches against a canonicalised query structure this is faster than using a graph isomorphism algorithm on non-canonical representations

36. Preprocessing the database Similar principles are used in some substructure search systems A tree structure is built, classifying all the atoms found in all the structures in the database first level based on atom type second level based on number of connections third level based on type of first neighbour fourth level based on type of second neighbour etc. lower levels based on classifications applied to neighbours (relaxation) bottom of tree lists structures that contain this class of atom

37. Tree-structured fragment searches

38. Tree-structured fragment search search can be done by tracing tree, looking for atom classes found in query combine lists of structures found at the bottom a backtracking atom-by-atom search may be needed to check hits found best-known example is Beilstein�s Crossfire main problem is updating the trees when new structures are added to the database

39. Substructure queries queries for substructure search systems may be more complicated than simple subgraphs different systems provide different capabilities variable atom and bond types specification ofallowed substitution

40. Substructure queries some systems provide very complex query options

41. Substructure queries: SMARTS Daylight uses an extension of SMILES to describe structure queries (SMARTS) can attach various properties to each atom [CX3] carbon with 3 connections [Nr5] nitrogen in a ring of size 5 properties can be combined with logical operators ! (NOT) & (AND � high precedence) , (OR) ; (AND � low precedence)

42. SMARTS complex patterns can be specified this way: [F,Cl, Br, I] any of the halogen atoms [!C;!R0] heteroatom in a ring $(smarts_string) can also be used as an atom property this is called recursive SMARTS e.g. $(NC=*) nitrogen single-bond carbon double-bond any-atom (i.e. an amide)

43. SMARTS recursive SMARTS can be used to describe very complex patterns e.g. primary or secondary amine, but not amide

44. Commercial systems Several software companies provide structure registration and search systems to the chemical/pharmaceutical industry MDL Information Systems Inc. MACCS, ISIS Daylight Chemical Information Systems Inc. THOR, MERLIN, DayCart (Oracle cartridge) IDBS ActivityBase Accelrys Accord Enterprise Informatics replaces Oxford Molecular RS3

45. Conclusions from Lecture 4 structure matching is an NP-complete problem worst-case time requirements rise exponentially with number of atoms involved heuristics (tricks) can be used to improve average search speed several algorithms have been published most use partitioning and relaxation techniques fingerprint screening can rapidly eliminate the bulk of non-matching structures different systems allow different degrees of sophistication in formulating search queries

46. Further Reading J. M. Barnard, �Substructure searching methods: old and new�, J. Chem. Inf. Comput. Sci., 1993, 33, 532-538 J. Xu. �Two dimensional structure and substructure searching.� In J. Gasteiger (ed.) Handbook of Chemoinformatics: From Data to Knowledge, Vol. 2, pp. 868-884, Wiley-VCH, 2003

47. Lecture 5: More structure searching Searching Markush structures in patents nature and origin of Markush structures fragment codes topological systems (MARPAT, Markush DARC) Reaction searching atom-atom mapping Maximal Common Substructure search what is the largest substructure common to two molecules?

Chemical Structure Representation and Search Systems Lecture 4. Nov 11, 2003

Chemical Structure Representation and Search Systems Lecture 4. Nov 11, 2003

Presentation Transcript

chemical structure representation and search systems lecture 3 ...

chemical structure representation and search systems lecture 2. oct 30, 2003

Chemical Structure Representation and Search Systems Lecture 5. Nov 13, 2003

Chemical Structure Representation and Search Systems Lecture 6. Nov 18, 2003

Data Representation 1 Lecture 4

Crystal Structure Lecture 4

Chapter 4. Chemical Structure and Polymer Properties

Lecture 11 Market Structure

Lecture #11, Nov. 4, 2002

Signals and Systems Fall 2003 Lecture #11 9 October 2003

Nov 4, 2003

VISTAS Modeling Overview Nov 4, 2003

Signals and Systems Fall 2003 Lecture #3 11 September 2003

Structure Analysis and Representation

Signals and Systems Fall 2003 Lecture #17 4 November 2003

Data Structure Lecture-4

Signals and Systems Lecture 4

TOPIC 4 CHEMICAL BONDING AND STRUCTURE

TOPIC 4 CHEMICAL BONDING AND STRUCTURE

TOPIC 4 CHEMICAL BONDING AND STRUCTURE

Signals and Systems Fall 2003 Lecture #17 4 November 2003