Introducing Chemoinformatics

Introducing Chemoinformatics BCCE Chemoinformatics Workshop July 2006 Gary Wiggins, David Wild Indiana University School of Informatics David Wild – Research Overview July 2006. Page 1

Chemoinformatics is … • Also known as cheminformatics or chemical informatics • Very differently defined, reflecting its cross-disciplinary nature • Librarian • Chemist (synthetic, medicinal, theoretical) • Biologist / Bioinformatician • Molecular modeler • Pharmaceutical or Chemical Engineer • Computer Scientist / Informatician David Wild – Research Overview July 2006. Page 2

A working definition of chemoinformatics Chemoinformatics (a.k.a. chemical informatics) is the branch of informatics dealing with all aspects of the representation and use of chemical structures, proteins, and related information, on computer. … It is an interdisciplinary field of that regularly pushes the boundaries of computer science, statistics, visualization methods, computing power and scientific technique. The subject covers a wide variety of applications and specialties, particularly in the pharmaceutical industry, where the rapid increase in new technologies in drug discovery puts chemoinformatics at the forefront of drug design. It is foundational to such diverse applications as 3D molecular modeling, artificial intelligence biological activity prediction methods, patent and chemical database searching, and high throughput screening data analysis. David Wild – Research Overview July 2006. Page 3

More definitions • Computational Chemistry – The application of mathematical and computational methods to particularly to theoretical chemistry • Molecular Modeling – Using 3D graphics and optimization techniques to help understand the nature and action of compounds and proteins • Computer-Aided Drug Design – The discipline of using computational techniques (including chemical informatics) to assist in the discovery and design of drugs. David Wild – Research Overview July 2006. Page 4

Chemoinformatics hits on Google Dec 2005 348,100 April 2005 125,600 July 2000 723 Number of word occurrences on Google, Taken fromhttp://www.molinspiration.com/chemoinformatics.html David Wild – Research Overview July 2006. Page 5

Hits on Chemoinf.com, August 15 – 29, 2005 (sitemeter.com) David Wild – Research Overview July 2006. Page 6

Traditional areas of application • Pharmaceutical & life science industry • particularly in early stage drug design • Databases of available chemicals • Electronic publishing • including searchable chemical structure information in journals, etc. • Government and patent databases David Wild – Research Overview July 2006. Page 7

The theoryso far (1960’s to present) … • How do you represent 2D and 3D chemical structures? • Not just a pretty picture • How do you search databases of chemical structures? • Google doesn’t help (much, but it might do soon…) • How do you organize large amounts of chemical information? • How do you visualize chemical structures & proteins? • Can computers predict how chemicals are going to behave • … in the test tube? • … in the body? David Wild – Research Overview July 2006. Page 8

Current trends & hot topics • The move of chemical informatics into the public domain (PubChem, MLI, eScience, open source) • Service-oriented architectures • Packaging & processing large volumes of complex information for human consumption • Integration with other –ics (bioinformatics, genomics, proteomics, systems biology) David Wild – Research Overview July 2006. Page 9

What does it mean for the bench chemist? • An increasing number of web tools and databases available which can aid in compound acquisition, synthesis, and biological profiling • A trend towards more (and more effective) use of computers in the lab - not just for email • A need for most synthetic chemists (and all medicinal chemists) to be aware of computational techniques and how they can assist in the compound synthesis and drug discovery processes • An opportunity to combine an interest in chemistry with an interest in computers David Wild – Research Overview July 2006. Page 10

Chemical Informatics Programs at IU • Graduate Certificate in Chemical Informatics • I571 Chemical Information Technology • I572 Computational Chemistry & Molecular Modeling • I573 Programming for Chemical and Life Science Informatics • Independent Study in Chemical Informatics • M.Sc. in Chemical Informatics • Ph.D. in Informatics (Chemical Informatics Track) David Wild – Research Overview July 2006. Page 11

Chemoinformatics software vendors • Accelrys-Large chemoinformatics company • ACD/Labs - analytical informatics & predictions • Digital Chemistry- 2D fingerprinting, clustering toolkits & software • Cambridgesoft - 2D drawing tools & E-notebooks • CAS- produce Scifinder Scholar searching software • ChemAxon - Java based toolkits and software • Daylight - 2D representation & searching software • Leadscope - 2D structure and property tools • Lion Bioscience - produce LeadNavigator • MDL- Large chemoinformatics company • Mesa Analytics and Computing - Educational & Statistical tools • Openeye- Fast 3D docking, structure generation, toolkits • Quantum Pharmaceuticals - prediction, docking, screening • Sage Informatics - ChemTK 2D analysis software • Tripos- Large chemoinformatics company David Wild – Research Overview July 2006. Page 12

Main academic sites • “Pure” Chemoinformatics • University of Sheffield, UK (Willett / Gillet) • http://www.shef.ac.uk/uni/academic/I-M/is/research/cirg.html • Erlangen, Germany (Gasteiger) • http://www2.chemie.uni-erlangen.de/ • Cambridge Unilever Center • http://www-ucc.ch.cam.ac.uk/ • Indiana University School of Informatics • http://www.informatics.indiana.edu/ • Related (computational chemistry, etc.) • UCSF (Kuntz) • http://mdi.ucsf.edu/ • University of Texas (Pearlman) • http://www.utexas.edu/pharmacy/divisions/pharmaceutics/faculty/pearlman.html • Yale (Jorgensen) • http://zarbi.chem.yale.edu/ • University of Michigan (Crippen) • http://www.umich.edu/~pharmacy/MedChem/faculty/crippen/ David Wild – Research Overview July 2006. Page 13

“Traditional” Journals • Journal of Chemical Information & Modeling (formerly JCICS) • http://pubs.acs.org/journals/jcisd8/index.html • Journal of Computer-Aided Molecular Design • http://www.kluweronline.com/issn/0920-654X • Journal of Molecular Graphics and Modeling • http://www.elsevier.com/inca/publications/store/5/2/5/0/1/2/ • Journal of Computational Chemistry • http://www3.interscience.wiley.com/cgi-bin/jhome/33822 • Journal of Chemical Theory and Computation • http://pubs.acs.org/journals/jctcce/ • Journal of Medicinal Chemistry • http://pubs.acs.org/journals/jmcmar/ David Wild – Research Overview July 2006. Page 14

“Informal” publications • Network Science (online) • http://www.netsci.org/Science/index.html • Chemical & Engineering News • http://pubs.acs.org/cen/ • Drug Discovery Today • http://www.drugdiscoverytoday.com/ • Scientific Computing World • http://www.scientific-computing.com/ • Bio-IT World • http://www.bio-itworld.com/ David Wild – Research Overview July 2006. Page 15

CHMINF-L Distribution List • Chemical Information Sources Discussion List • Created by Gary Wiggins at IUB • http://listserv.indiana.edu/archives/chminf-l.html David Wild – Research Overview July 2006. Page 16

Yahoo! Chemoinformatics Discussion List • For • Job postings • Ideas exchange • Questions • Industry – Student connections To join, go to http://groups.yahoo.com/group/chemoinf Or send an email to chemoinf-subscribe@yahoogroups.com David Wild – Research Overview July 2006. Page 17

Impacting Industry David Wild – Research Overview July 2006. Page 18

Example 1High-Throughput Screening Testing perhaps millions of compounds in a corporate collection to see if any show activity against a certain disease protein David Wild – Research Overview July 2006. Page 19

High-Throughput Screening • Traditionally, small numbers of compounds were tested for a particular project or therapeutic area • About 10 years ago, technology developed that enabled large numbers of compounds to be assayed quickly • High-throughput screening can now test 100,000 compounds a day for activity against a protein target • Maybe tens of thousands of these compounds will show some activity for the protein • The chemist needs to intelligently select the 2 - 3 classes of compounds that show the most promise for being drugs to follow-up David Wild – Research Overview July 2006. Page 20

Informatics Implications • Need to be able to store chemical structure and biological data for millions of data points • Computational representation of 2D structure • Need to be able to organize thousands of active compounds into meaningful groups • Group similar structures together and relate to activity • Need to learn as much information as possible(data mining) • Apply statistical methods to the structures and related information David Wild – Research Overview July 2006. Page 21

Tools for mining the data Tripos Benchware HTS Dataminer (formerly SAR Navigator), www.tripos.com David Wild – Research Overview July 2006. Page 22

Example 2: 3D Visualization & Docking • 3D Visualization of interactions between compounds and proteins • “Docking” compounds into proteins computationally David Wild – Research Overview July 2006. Page 23

3D Visualization • X-ray crystallography and NMR Spectroscopy can reveal 3D structure of protein and bound compounds • Visualization of these “complexes” of proteins and potential drugs can help scientists understand the mechanism of action of the drug and to improve the design of a drug • Visualization uses computational “ball and stick” model of atoms and bonds, as well as surfaces • Stereoscopic visualization available David Wild – Research Overview July 2006. Page 24

Accelrys Discovery Studio David Wild – Research Overview July 2006. Page 25

Docking algorithms • Require 3D atomic structure for protein, and 3D structure for compound (“ligand”) • May require initial rough positioning for the ligand • Will use an optimization method to try and find the best rotation and translation of the ligand in the protein, for optimal binding affinity David Wild – Research Overview July 2006. Page 26

Genetic Algorithms • Create a “population” of possible solutions, encoded as “chromosomes” • Use “fitness function” to score solutions • Good solutions are combined together (“crossover”) and altered (“mutation”) to provide new solutions • The process repeats until the population “converges” on a solution David Wild – Research Overview July 2006. Page 27

Sample GOLD output GMP into RNaseT1 David Wild – Research Overview July 2006. Page 28

Something fun… Screensaver that docks molecules while your computer is idle at http://www.grid.org/projects/cancer/ David Wild – Research Overview July 2006. Page 29

Representing 2D structures with SMILES David Wild – Research Overview July 2006. Page 30

Historical ways of representing chemicals • Trivial name, e.g. Baking Soda, Aspirin, Citric Acid, etc. Identifies the compound, but gives no (or little) information about what it consists of • Chemical formula, e.g. C6H12O6. Specifies the type and quantity of the atoms in the compound, but not its structure (i.e. how the atoms are connected by bonds) • Systematic name, e.g. 1,2-dibromo-3-chloropropane. Identifies the atoms present and how they are connected by bonds. David Wild – Research Overview July 2006. Page 31

Trivial and Systematic Names Trivial name: • tyrosine Systematic names: • -(p-hydroxyphenyl)alanine • -amino-p-hydroxyhydrocinnamic acid David Wild – Research Overview July 2006. Page 32

Historical ways of representing chemicals 2D structure diagram shows atoms present and how they are connected by bonds • 3D structure diagram, shows how atoms are related to each other in 3D space. Can take a variety of forms. Accurate models only really possible since X-ray crystallography and computers… but ball and stick models have been around a long time! David Wild – Research Overview July 2006. Page 33

Early computer representations • How do we communicate structural information between humans and the computer? • Line notations, e.g. Wiswesser Line Notation (and later SMILES) • How do we represent the atoms and bonds in a molecule internally in a computer? • Atom lookup and connection tables David Wild – Research Overview July 2006. Page 34

Linear notations • Represent the atoms, bonds and connectivity of a molecule in a linear text string • Consise representation • Originally designed for manual command line entry into text-only systems • Now an excellent format for file and database storage (e.g. can be held in a spreadsheet cell, on one line of a text file, or in an Oracle database text field) David Wild – Research Overview July 2006. Page 35

Wiswesser Line Notation (obsolete) • WLN for this structure is QVYZ1R DQ • Uses text symbolic representation of function groups, e.g.: • Q = OH, V= -CO-, Z = -NH2, R = benzene • Other symbols represent branching, e.g. Y David Wild – Research Overview July 2006. Page 36

SMILES Dave Weininger, Daylight www.daylight.com • (one possible) SMILES for this structure is OC(=O)C(N)CC1=CC=C(O)C=C1 • Can identify any chemical structure • There can be several ways of writing the same strucutre in SMILES (although a system of generating canonical SMILES) exists David Wild – Research Overview July 2006. Page 37

SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propane CCC David Wild – Research Overview July 2006. Page 38

SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit 1-Propanol CCCO Or OCCC ! David Wild – Research Overview July 2006. Page 39

SMILES – Atoms & Bonds • Atoms represented by their chemical symbol (C, N, S, O, Br, etc). Uppercase for aliphatic, lowercase for aromatic • Adjacent atoms implicitly single bonded, or = for double bond, or # for triple bond • Hydrogens usually implicit Propene C=CC Or CC=C ! David Wild – Research Overview July 2006. Page 40

SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points 2-Propanol CC(O)C David Wild – Research Overview July 2006. Page 41

SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Cyclohexane C1CCCCC1 David Wild – Research Overview July 2006. Page 42

SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Benzene c1ccccc1 David Wild – Research Overview July 2006. Page 43

SMILES – Branching & Rings • Parentheses represent branching • Ring enclosures represented by using numbers to signify attachment points Bromobenzene c1cc(Cl)ccc1 David Wild – Research Overview July 2006. Page 44

SMILES – Acetaminophen (Tylenol) Acetaminophen c1c(O)ccc(NC(=O)C)c1 David Wild – Research Overview July 2006. Page 45

SMILES – multiple ring structure Indole c1ccc2[nH]ccc2c1 David Wild – Research Overview July 2006. Page 46

Other SMILES notes • All Hydrogen atoms are implicit unless declared otherwise • Non-organic (i.e. not C,N,S,O,Cl,Br), Hydrogens and modified atoms neet to be placed in square brackets, e.g. [Pb], [Xe] • Charged species indicated by a + or – (and square brackets), e.g. [Na+], [N+], [O-], [Ca++] • Unknown atoms can be represented by a * (but watch out for confusion with SMARTS!) • Stereochemistry can be indicated using @@ • “Canonical SMILES” can be created David Wild – Research Overview July 2006. Page 47

SMILES Homepage http://www.daylight.com/smiles/ Official Syntax Guide • Tutorial • Examples • Resources David Wild – Research Overview July 2006. Page 48

Other Line Notations • ROSDAL - Beilstein Representation Of Structure Diagram Arranged Linearly 1O-2=3O,2-4-5N,4-6-7=-12-7,10-13O • Sybyl Line Notation (SLN) - Tripos OHC(=O)CH(NH2)CH2C[1]=CHCH=C(OH)CH=CH@1 David Wild – Research Overview July 2006. Page 49

Example free online web resources For more links, see http://www.chemoinf.com/ David Wild – Research Overview July 2006. Page 50

Introducing Chemoinformatics