180 likes | 206 Views
3. Chemical Data and Data Bases. Datasets and Databases. Many small datasets are available Several commercial databases of compounds and reactions (e.g. CAS) Large but not comprehensive public databases of compounds are just starting to become available
E N D
Datasets and Databases • Many small datasets are available • Several commercial databases of compounds and reactions (e.g. CAS) • Large but not comprehensive public databases of compounds are just starting to become available • As of today, there is no large public database of reactions 2
Data: Small Datasets (examples) • Mutag (Mutagenicity) • 200 compounds (125/63), mutagenicity in Salmonella • PTC (Predictive Toxicity Challenge) • A few hundred compounds, carcinogenicity (FM,MM,FR,MR) • NCI (Anti-cancer activity) • 70,000 compounds screened for ability to inhibit growth in 60 human tumor cell lines • Alkanes (Boiling points) • All 150 non-cyclic alkanes (CnH2n+2) with n<11 and their boiling points ([-164,174]) • Benzodiazepines (QSAR) • 79 1,4-benzodiazepines-2-one, affinity towards GABAA • Solubility (Delaney and XLogP) • 1440 compounds (Delaney); 1991 compounds (XLogP) 3
Large Databases • Private/ Commercial • Example: ACS Chemical Registry (CAS) [~10sM] • Expensive and cannot be “mined” • Cambridge Structural DB (CSD) [crytallographic structures, ~350K] • More recent trends • Example: eMolecules (formerly Chmoogle) • Free search engine but cannot be “mined” 4
Large “Public” Databases • Zinc (UCSF) • ChemBank (Harvard) • PubChem (NIH) • ChemDB (UCI) http://cdb.ics.uci.edu J. Chen, S. J. Swamidass, Y. Dou, J. Bruand, and P. Baldi ChemDB: A Public Database of Small Molecules and Related Chemoinformatics Resources. Bioinformatics, 21, 4133-4139, (2005) 7
Example of Large Public DB: ChemDB • ~5M unique compounds • Commercially available compounds • PostgreSQL/Oracle • Annotation (Experimental, Computational) • Searchable • Web interface • Similarity, in silico reactions,… 8
R M ChemDB RChemDB Filters Experiments NM 16
Chemo/Bio Informatics Two Key Ingredients 1. Data 2. Similarity Measures Bioinformatics analogy and differences: • Data (GenBank, Swissprot, PDB) • Similarity (BLAST) 17