300 likes | 315 Views
2. Molecular Representations. Communicating Chemical Data. Chemical data: Text, numbers, and molecules Standard valence model of chemistry Discrete bonds represent shared electrons Codify into a reproducible representation Graph of atoms and bonds is most commonly
E N D
Communicating Chemical Data • Chemical data: Text, numbers, and molecules • Standard valence model of chemistry • Discrete bonds represent shared electrons • Codify into a reproducible representation • Graph of atoms and bonds is most commonly understood representation 2
2D Graph of Atoms / Bonds • Labeled graph • Nodes = Atoms, Symbols = {C, N, O, H, …} • Edges = Bonds, Order = {1, 2, 3, aromatic,..} • Organic compound shorthand • Assumed carbons • Implicit hydrogens • Standard valence rules 3
Tractability • Small, Tree-Like, Graphs • Number of vertices is small (e.g. less than 50) • Number of edges is small (average degree ~2.3 or so) • Tree-like 4
3 1 2 4 2D Data Formats • Bond matrix formats exist but size ~ nAtoms2 • Connection table • List of nodes • C1, C2, O3, N4 • List of edges • 1-2, 2=3, 2-4 • SDFile, Mol2 Formats • Not human writeable 5
1D Line Notations • Should be human parseable to facilitate communication without computer module • Nomenclature • IUPAC system: 2-amino-3-phenyl-propanoic acid • Common names: phenylalanine • SMILES: C(C(O)=O)(Cc1ccccc1)N • Widely used, non-standardized • InChi • Recent, IUPAC supported official standard • Ex. 1/C9H11NO2/c10-8(9(11)12)6-7-4-2-1-3-5-7/h1-5,8H,6,10H2,(H,11,12) 6 http://www.iupac.org/inchi/
IUPAC Nomenclature • IUPAC Standard Naming Conventions • propane • propanoic acid • 3-hydroxy-propanoic acid • 2-amino-3-hydroxy-propanoic acid • Unwieldy standard and inconsistent adoption • “Common” names and abbreviations (Serine) • Systematic bidirectional translation unreliable 7
SMILES Basics • Connection tables as a character string • Atoms: Atomic symbols {C,N,O,S, …} • Bonds: single “-” (implicit), double “=“, triple “#” • Examples • CBr • C=O • C#N • O=CC#N 8 http://www.daylight.com/smiles/
SMILES Basics • Branching: Parentheses • Cycles: Numerical annotations • CCC(O)C • CC(N)(N)O • C1CCCC1 • N12CCCCC1CCCC2 • N#CC(C#N)N1C=CC=C1 • Extensions for • Inorganic atoms, unusual valence, formal charges, stereochemistry, aromaticity, reactions, etc. 9
Canonical Representations • Unique representation needed for rapid DB lookup and to check uniqueness • Need to uniquely order the atoms of a molecule • nAtoms! atom orderings possible • Morgan Algorithm • Label nodes by connectivity (heavy degree) • Relax iteratively towards extended connectivity (EC) using neighbor values • Use EC magnitude to decide on atom order • EC “tie-breaking” by atom, bond distinctions 10
Stereochemistry / Isomers • Chemical “handedness” • Same connectivity, but not superimposable • Atoms with at least 4 distinct components • Double bonds with distinct components at ends • Specification by atom / bond labels e.g. O/C=C/N vs. O/C=C\N e.g. C[C@H](N)O vs. C[C@@H](N)O 11
3D Atomic Coordinates • 2D graph only specifies connections • 3D spatial coordinates (center, radius, surface) • Largely unavailable • Usually predicted 12
4D Conformers • Molecules are relatively rigid w.r.t. • Bond length • Bond angles • Single bonds are very flexible w.r.t rotation • More information with collection of multiple, static 3D conformations 13
Molecular Surfaces • For intermolecular interactions, externally “visible” surface is most important • Representations: Orbitals,VDW Radii, Accessibility, Tessellations 14 http://www.netsci.org/Science/Compchem/feature14e.html
Valence Model Limitations • “Bonds” are non-existent • Model of shared electron orbitals • Difficulty modeling • Aromaticity • Resonance • Tautomers • Etc. 15
Structural Keys • Motivated in part by rapid screening for “functional group” substructures • Pre-compute presence of common / important substructures up front and record in bit-vector • Example of structural keys • Presence of atoms (C, N, O, S, Cl, Br, etc.) • Ring systems • Functional Groups • Aromatic, Phenol, Alcohol (ROH), Amine (RNH2), Acid(RC(=O)OH), Ester, … 16
Generalized Fingerprints • Structural Keys • Generalizes only in proportion to knowledge • Sparsely populated • Good screening filter will have thousands of keys, but each item generally only has a few dozen • Generalized Fingerprints (Spectral Representations) • No pre-defined patterns • Record counts or presence/absence of “substructures” (e.g. labeled paths, trees, etc) • Fixed length (binary) vectors • Fast algorithms • Abstract, hard to traceback meaning of individual bits 21
Systematic Graph Features • For chemical compounds • atom/node labels: A = {C,N,O,H, … } • bond/edge labels: B = {s, d, t, ar, … } • Trace Paths • Depth First Search (CsNsCdO) 22
Integer RNG Seed Several Integers Fingerprint Flowchart • 0 Bonds • O • C • N • 1 Bond • O=C • C–C • C–N • 2 Bonds • O=C-C • C-C-N • 3 Bonds • O=C-C-N Graph Feature Extractor Random Number Generator Hash Function Modulo FP Size [ 0 0 0 0 0 0 0 0 … 0 0 0 0 0 0 0 0 0 0 ] [ 0 0 1 0 1 0 0 0 … 0 1 1 1 0 1 0 1 0 0 ] 23
Other “Fingerprint” Representations • Derived Representations • Information Compression • Example: Local Sensitive Hashing (LSH) • Choose K random lines in high-dimensional space • Project data points • Bin coordinates 24
Summary • Rich set of representations • 1D: SMILES, Fingerprints • 2D: Graph of Bonds • 2.5D: Surfaces • 3D: Coordinates • 3.5D: Conformers • 4D: Isomers, temporal evolution, etc 25
Chemical Informatics • Informatics must be able to deal with variable-size structured data or convert data to “standard” vectorial format • Graphical Models • (Recursive) Neural Networks • ILP • GA • SGs • Kernels 26
Slide Title (Arial 44 pt) • Font Arial 32 pt • Font Arial 28 pt • This Arial 24 pt • 20pt • Again 20 pt – do not use font sizes < 20 pt 27 Place useful information here i.e. Overview