410 likes | 681 Views
Maximizing the Use of the Lawson Number in Beilstein Searching. Gary Wiggins and Usha Coca School of Informatics Indiana University ACS CERM, June 4, 2004. Abstract.
E N D
Maximizing the Use of the Lawson Number in Beilstein Searching Gary Wiggins and Usha Coca School of Informatics Indiana University ACS CERM, June 4, 2004
Abstract • In the Beilstein database, the Lawson Number is based on the Beilstein System for classifying organic substances. Every substance in the Beilstein file has at least one Lawson Number, and the smaller the Lawson Number, the more common is the fragment. While the Lawson Number is a searchable field, searching with Lawson Numbers is not equivalent to substructure or Markush searching. Since the Lawson Numbers represent certain structural fragments, they can be used for structural similarity searches. Searches that include the Lawson Number are effective when used in combination with other search keys, such as molecular formula, element ranges, etc. It is also useful when combined with NOT in substructure searches. Thus, the Lawson Number could serve as an effective index search key if its meaning were known. We have developed a prototype system that could be interfaced with the CrossFire system for effective use of the Lawson Numbers in searching. The system will be described and demonstrated.
Beilstein Handbook of Organic Chemistry in 27 “volumes” Series Abbrev. Coverage Basic H up to 1910 Sup Ser I E I 1910-1919 Sup Ser II E II 1920-1929 Sup Ser III E III 1930-1949 (v. 1-16 only) Sup Ser III/IV E III/IV 1930-1959 (v. 17-27 only) Sup Ser IV E IV 1950-1959 (v. 1-16 only) Sup Ser V E V 1960-1979 (English)
Psst! Want a good, cheap set of Beilstein? We have finally decided that our cramped Chemistry Library can no longer afford the luxury of retaining our Beilstein print collection (which has probably not been touched for several years now, since we acquired the online version). We hope we can find a new home for the collection (all 437 volumes, plus a handful of how-to-use-it texts), otherwise it must be discarded. Any organization willing to pay the shipping costs is welcome to this collection. If interested, please contact me directly. Howard M. Dess Chemistry and Physics Librarian Library of Science and Medicine Rutgers University Piscataway, NJ 08854-8009 Source: CHMINF-L, June 1, 2004
Beilstein Handbook: Arrangement of Compounds • Beilstein: a collection of critically evaluated data on organic compounds arranged in a classified manner • Arrangement: • Acyclic Compounds, Volumes 1-4 • Isocyclic Compounds, Volumes 5-16 • Heterocyclic Compounds, Volumes 17-27 • Divided into System Numbers 1-4720 • Each Supplementary Series (E) volume contains the same classes of compounds as the corresponding Basic (H) volume
System Number Meaning • Beilstein Institute never published the meanings of the System Numbers • System Number 3691 means "heterocyclic carbon frameworks with exactly 2 N ring atoms with a combination of exactly 2 hydroxy groups and 1 carboxylic acid group”
Placement of Info in Beilstein: Registry (Index) Compounds • Stem nuclei: Hydrocarbons, saturated followed by unsaturated • Oxy = Hydroxy compounds: alcohols (OH) • Oxo = Carbonyl compounds: aldehydes and ketones (C=O) • Carboxylic Acids (COOH) • Sulfinic Acids (SO2H) • Sulfonic Acids (SO3H) • Chalcogen Oxoacids (XO2H, XO2OH); X = S, Se, Te • Amines (NH2) • Hydroxylamines (NHOH) & Dihydroxylamines (N(OH)2) • Hydrazines (NHNH2) • Azo compounds (N=NH) • More complex N functionalities • Group containing other elements (P, As, Si, Mg, etc.)
Beilstein System Algorithm 1 • Beilstein “hydrolysis” scheme based on an instinctive chemical classification as perceived by an organic chemist • Carbons with more than one (non-ring) heteroatom attached are always regarded as derived from carbonyl groups, if: • at least one of the heteroatoms is other than the attachment atom of a substituent (halogen, nitro, nitroso, azide)
Beilstein System Algorithm 2 • Splits any molecule into a set of fragments • Splitting points are C-Q bonds, where Q is a heteroatom that does not belong to a ring in common with the C in question • Fragments then classified and coded using • skeletal features • type and multiplicity of chemical functional groups (including masked groups) • degree of unsaturation • carbon number (See "Notes for Users" at the start of each Beilstein volume published from about 1992 onwards.)
Source of Ambiguity • In the physical Beilstein Handbook, the end of one system number and the beginning of another sometimes occur on the same physical page. • Leads to bleed-over from the previous section (e.g., alkyl hydrocarbons linked to the simplest alcohol, Methane)
Lawson Number • Originally used in the program SANDRA • Algorithmic expression of the System-Numbers in the printed work • System Numbers: 1-4720 • Lawson Numbers: 8-32759 • System Number = Lawson Number divided by 8 (roughly) • Inherited the ambiguity of the page number placement
Lawson Number: Purpose • To divide the total virtual structure universe of published and unpublished compounds into approximately equal sections (virtual pages) of related compounds
Lawson Number Occurrence 1 • Any compound may have several LNs; most have 2 to 3. • In 1991, (1.8 million compounds in the file at that time): • 25.1% had 1 • 39.4% had 2 • 24.0% had 3 • 8.5% had 4 • 3.0% had > 4 • Average LN occurred in about 70 compounds in 1991
Lawson Number Occurrence 2 • Occasionally a LN will represent a unique structure, e.g., LN 12, retrieves only BRN 4736629:
What governs the value of the LN? In order of influence: • Cyclic class (number and type of heteroatoms) • Chemical functions (amine, hydroxy, etc.) • Degree of unsaturation of the carbon framework wrt multiple bonds at carbon + ring closures • Carbon count of the carbon-complete fragment framework • Degree of carbon branching • Degree of halogen and nitro substitution • Chalcogen exchange • Ring sizes
Beilstein Handbook of Organic Chemistry: SANDRA • SANDRA, Structure AND Reference Analyzer • Program that interpreted a graphical structure of a compound and predicted where it should be found in printed Beilstein • Developed in 1987 by Alexander Lawson for use on a local microcomputer • SANDRA fragment screens had a heavy chemical bias: classified according to chemical structure
Beilstein Handbook of Organic Chemistry: SANDRA • 12-digit code linked information to page ranges
Beilstein Handbook of Organic Chemistry: SANDRA • This compound belongs in v. 13 Syst. 1823 H p. 348 • Hashcodes: • Ethylamine 000500010002 • Phenol 800100010906 • Non-localized amino-cyclohexanol 800510010306
Beilstein Handbook of Organic Chemistry SANDRA • 12-digit hash code had corresponding 4-digit code,e.g., the number 1849 linked 800510010306 to System no. 1823, H-page 348. • Four-digit number retained the sortability of the 12-digit code, but gives a hashcode for each fragment that can be stored in 2 bytes: 7392-28C1-1610
Lawson Number Planned Enhancements (around 1990) • A second phase of the LN implementation never materialized for LNs greater than 32767 • was to include 8000 shape discriminators to help avoid false drops, with LN values in the range 32776-40951 • Ring skeletal shapes for all mono and bicyclic systems (including fused, bridged, and spiro rings) of 3-10 ring atoms, containing 0, 1, or 2 heteroatoms of the set (O,N,S) in any combination or any ring position would get a unique LN • For rings with 11-17 atoms including O,N,S ring atoms would get a LN • Another LN for those with heteroatoms other than N, O, S • All mono and bicyclic systems with 18 or more ring-atoms were to get one LN • A single LN for for tricyclic and greater ring systems (Further discrimination could be based on present or not present, such as steroid skeletons, morphane, adamantine, etc.)
Lawson Number Uses • Most effectively used when combined with other search elements, e.g.: • Molecular Formula • Element Ranges • Boolean operator NOT in combination with substructures
Lawson Number Search Toolhttp://mypage.iu.edu/~ucoca/begperl/formFetch.html
Lawson Number Search in Usha’s DB for COOH/O-R/(O4) • Retrieves (among seven LN ranges): LN Range Function 31456-31471 COOH/O-R/(O4)
Beilstein CrossFire Search for LN Range 31456-31471 • Yielded 10,467 hits on 4/15/2004 • One of those was BRN 18833 with LNs 31459 and 289:
Lawson Number Search in Usha’s DB • Revealed that LN 289 is O-R(*1) • Combining the previous Beilstein CrossFire search with LN 289 yielded 4910 hits on 4/15/2004.
Lawson Number Search in Beilstein CrossFire • Find a compound with a cyclopentane ring with three free sites (over 440,000 substances) and with both LN 31459 and LN 289 • Result: 10 substances on 4/15/2004
Lawson Number Range Search # 2 on CrossFire • 23369 –25200 • Yielded 668,065 substances on 6/3/04 • When combined with the chemical name segment Aziridin* in proximity to Propion*, the search yielded 142 substances.
Lawson Number 24059 • Parent Heterocycles N(1)
Possible to Link CrossFire to Usha’s Web Tool • Hop in feature • Allows users to jump into CrossFire Commander and run a search from a link on the Web (or from an external package)
Conclusion While the Lawson Number was originally developed as a tool to aid in finding the correct place for a given compound in the printed Beilstein, it clearly has utility in online searches of the Beilstein database. Having a Web supplement that defines the meaning of the Lawson Numbers will enhance the usefulness of the search field.
Bibliography and Acknowledgement The generous input from Dr. Alexander Lawson is much appreciated! • Lawson, Alexander J. “Structure graphics in: pointers to Beilstein out.” in: Warr, Wendy A., ed. Graphics for chemical structures: integration with text and data. (ACS Symposium Series; 341) American Chemical Society: Washington, 1987, 80-87. • Lawson, Alexander J. “Chemical structure browsing.” in: Warr, Wendy A., ed. Chemical structure information systems: Interfaces, communication, and standards. (ACS Symposium Series; 400) American Chemical Society: Washington, 1989, 41-49. • Lawson, Alexander J. “The Lawson similarity number (LN). Offline generation and online use.” in: Heller, Stephen R., ed. The Beilstein online database: implementation, content, and retrieval. (ACS Symposium Series; 436) American Chemical Society: Washington, 1990, 143-155.
Bibliography • Sunkel,J.; Hoffman, E.; Luckenbach, R. “Straightforward procedure for locating chemical compounds in the Beilstein Handbook.” Journal of Chemical Education1981, 58(12), 982-986.. • “A powerful tool for chemists: The Lawson-Number.” [brochure] Springer-Verlag, Berlin: 1989?. • Lawson, Alexander. Personal communication. 22 June 2001. • Meehan, Paul; Schofield, Helen. “CrossFire; a structural revolution for chemists.” Online Information Review2001, 25(4), 241-249.
MIMAS (Manchester Information & Associated Services) • JISC-supported UK national data center • Run by Manchester Computing at the University of Manchester • Provides access to ISI Web of Knowledge, JSTOR, CrossFire, etc. • http://www.mimas.ac.uk/
MIMAS CrossFire Services • Very useful documentation • http://www.mimas.ac.uk/crossfire/docs.html • Introductory guides • Training materials • Manuals
UW-Madison CrossFire Site • Links to a locally-produced help file • http://chemistry.library.wisc.edu/beilstein/home.htm • Quick Guide • http://chemistry.library.wisc.edu/beilstein/quickguide.htm
Beilstein on STN • Beilstein on STN (Workshop Manual). FIZ Karlsruhe: Eggenstein-Leopoldshafen, 2003. • http://www.stn-international.com/training_center/chemistry/beilstein/beilstein_wsm.pdf
MDL Web Site • Replaces the former Beilstein site • MDL Knowledge Base • http://www.mdl.com/support/knowledgebase/index.jsp