270 likes | 506 Views
Benchmarking JChem Oracle and Instant-JChem (and more). Tobias Kind FiehnLab at UC Davis Genome Center November 2006. Free Academic Licenses for JChem and Instant JChem provided by. ChemAxon product suite. Source: Chemaxon.com. We have free academic licenses for all products.
E N D
Benchmarking JChem Oracle and Instant-JChem (and more) Tobias Kind FiehnLab at UC Davis Genome Center November 2006 Free Academic Licenses for JChem and Instant JChem provided by
ChemAxon product suite Source: Chemaxon.com We have free academic licenses for all products
Metabolomics @ Fiehnlab- The science of the small molecules • Compound Classes: • sugars • amino acids • steroids • fatty acids • lipids • phospholipids • organic acids ... Molecules under investigation (shown with ChemAxon Marvin) Visit us @ fiehnlab.ucdavis.edu 3D model of a molecule with surface plot (shown with ChemAxon MarvinSpace)
Metabolomics is a truly emerging science ...tries to identify all small molecules (< 2000 Da)in all life forms in a comprehensive manner Life Science Tree: Genomics (DNA) Transcriptomics (RNA) Proteomics (Proteins)Metabolomics (Small Molecules)
Techniques and tools • Analytical techniques (LC-MS, GC-MS, NMR, IR) • BioInformatics, Cheminformatics LTQ-FT-MS Gas ChromatographyGC-TOF-MS Liquid ChromatographyLC-MS BioInformatics and CheminformaticsStatistics (Statistica Dataminer) Open Source + commercial software
We use cheminformatics tools for mass spectrometry based structure elucidation See our BMC Bioinformatics paper: Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm ; http://www.biomedcentral.com/1471-2105/7/234
What are JChem and Instant-JChem? JChem and Instant JChem are cheminformatics tools for handling small molecule structures together with substance data (logP, fingerprint, pKa, toxicity, meta-information) + searches + filter + web connections and more Difference: JChem = complex package and Instant-JChem = one single tool Picture ChemAxon JChem Instant-JChem
Benchmarking Instant-JChem and JChem Oracle (and more) Myth 1: JChem+Oracle is faster than Instant-JChem+Apache Derby – Reality: lets see... Myth 2: JAVA is slow – Reality:Its fast (70% of C++). Myth 3: Old Intel Netbust Xeons (Netburst) are slow – Reality:Yes. Myth 4: Oracle is a hazzelfree and handsome DB for beginners – Reality: Myth 6: 2 CPUs are better than one – Reality: Yes. Myth 7: Comparing apples with oranges (in germany pears) is unfair - c'mon... Only first myth left.
1st time Oracle user A bit of Oracle Reality Happy Oracle Acepaid 10K for certificate • Oracle works, lots of people invested lots of mony (ORCL market cap = 92 billion dollars) • Its good for large data (TByte) - Its overkill for a small DB. • If you plan to install it on your production workstation (a big No No) • It will eat 600-800 MB of your valuable RAM (for nothing, on WINXP 32 bit) • It will create 15,049 files in 2,029 folders (for what?) • It will create a lot of hassle with certain network setups (DHCP) • RTFM (read the … manual) is no joke and you need to learn SQL (try the free Aqua Data Studio) • Complete learning will take you 1..2 years, but gives you extreme flexibility • If you plan to install JCHEM + Oracle you need • JChem (includes cartride for Oracle) • Oracle • Apache Tomcat • 1-2 days time (ChemAxon documentation is good, but too many things can go wrong with Oracle)
A bit of Instant JChem Reality v1.0 A) Download http://www.chemaxon.com/instantjchem/ B) Install C) It Runs instantly • inbuilt Apache Derby DB • JAVA engine included • complete JChem included • out-of-the-box tool • can connect to other DBs
Importing Structures into Instant JChem During import in Instant JChem only one CPU works. The fingerprint calculation is probably not multi-threaded. (Solution: work pool = make pool for n CPUs) Short import time is critical for user convinience, but not for long term database projects.
Importing Structures into Instant-JChem influence of JAVA hotspot compiler JAVA VM runs in to modes: with client compiler and server compiler (directories under JRE) If you run any calculation intensive programs alwyas use server mode, in a batch file call java –server XYZ Good and fast Bad and slow
Influence of JAVA hotspot compiler Importing Structures into Instant-JChem Import of 250k structures (NCI99.smi) into Instant-JChem: Server JVM is 20% faster! lower is better Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1 GByte/s transfer)
Influence of JAVA hotspot compiler with Instant-JChem Task: Search for substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results. JAVA server mode: 15 seconds (30% faster) JAVA client mode: 21 seconds SMILES: NC1=CC=NC2=C1C=CC(Cl)=C2 If you want to speed-up this queryyou need to pre-calculate and include all descriptors already in the database http://en.wikipedia.org/wiki/Lipinski's_Rule_of_Five (mass() <= 500) && (logP() <= 5) && (donorCount() <= 5) && (acceptor Count() <= 10) (acceptor count for C and H)
Influence of number of CPUs with Instant-JChem Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results Doing the Lipinski utilizes both CPU cores!Try Intel Quad! Try Opteron 8x! Testsystem: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer)
Influence of number of CPUs with Instant-JChem Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results (on the fly) Doing the Lipinski utilizes multiple CPU cores!However a single logP calculation is dependent on CPU speed, not CPU cores. Use AMD Opteron 8xCPU systems (or better). For cheaper setups use Intel Core 2 Quad (QX6700). Testsystem*: Dual Opteron 254 (2.8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);Testsystem** : 4 x Dual-Core Opteron 870 2.0 GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space
Influence of number of CPUs on complex calculations with Instant-JChem Task: Search in 1000 compounds from PubChem-1000-demo and calculate on-the-fly: Take home message: The more complex the request – the more CPUs you need. The lead likeness has 7 filters and reaches a 5-8 times speed-up with more CPUs. Testsystem*: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer rate);Testsystem** : 4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB set for JAVA heap space
Scaling complex calculations to larger DBs with Instant-JChem Task: Now search in 250,000 compounds from NCI2000 and calculate on the fly: Take home message: Do not extrapolate calculational times from different or smaller DBs. The speedups here are 2-35 larger than expected. Pre-calculate values once and store them in the DB and query values later. Testsystem** : 4 x Dual-Core Opteron 870 2 GHz; CentOS 64-bit, 32 GByte RAM, 3.5 GB max set for JAVA heap space 1.5 GByte JAVA heap space used.
Derby database file sizes for Instant- JChem+Apache Derby Compounds only 100k structures ~30 MByte1 Mio structures ~300 MByte 10 Mio structures ~3 GByte 20 Mio structures ~6 Gbyte If you have dual or quad cores turn drive compression on.You can save almost 50% space, speed overhead is low.
Instant-JChem on disk based and RAMDisk based systems People who said the OS has efficient disk caching lied. A large RAMDISK can speed up your system extremely. A) If you have money – buy a Solid State DiskRAMSAN-400; 128 GByte; Price $252,7203,000 MB/s random sustained external throughput. B) If you have some money – buy a RAID5 card.ARECA ARC-1120 for 8 HDs, Price $500 200-400 MB/s read and write access C) If you have litte money – buy a RAMDISKand stuff as much RAM in as possible (take a 64-bit OS) 500-1000 MB/s read and write access ...a normal hard drive has ~30-50 MB/s transfer rate
Instant-JChem on disk based and RAMdisk based system A) Heap Memory max 800 MByte (OK) Load 3 Mio compound DB from Ramdisk: 2 seconds Load 3 Mio compound DB from RAID5 disk: 11 seconds (factor 5) Search Substructure from RAMDISK DB: instant (imemory buffered) Search Substructure from RAID5 DB: instant (memory buffered) B) Heap Memory max 200 MByte (too low) Load 3 Mio compound DB from Ramdisk: 19 seconds Load 3 Mio compound DB from RAID5 disk: 25 seconds (factor 1.3) Search Substructure from RAMDISK DB: 22 seconds Search Substructure from RAID5 DB: 38 seconds (factor 1.7) No Heap memory: Performance degradation: Everything must be read from disk; My RAID5 is already extremely fast, still the RAMDISK is even faster Take home message: give JAVA (JChem) as much heap memory as you can. For 3 Million structures you need minimum 300 MByte heap space.
JChem+Oracle DB on Xeon vs. Instant-JChem+Apache Derby DB on Opteron (apples vs. oranges) Task: Import and indexing 3 million compounds (NCI2000 duplicated to 3 Mio) 3GHz Dual Xeon with 2GB system memory - JChem+Oracle DB = 5801 seconds (96 minutes) 2.8 GHz Dual Opteron with 2,88 GB memory - Instant-JChem+Apache Derby = 5333 seconds (88 minutes) Take home message: If you have a (modest) modern computer it can handle JChem and Instant-JChem and a local database can be faster than a remote database Source Xeon data: Oracle Cartridge Benchmark http://www.chemaxon.com/jchem/FAQ.html#benchmark3
Instant-JChem+Apache Derby DB on Socrates* vs. Instant-JChem+Apache Derby DB on Dual Opteron 2.8 GHz (WIN-XP)** vs. JChem+Oracle DB on Dual Xeon 3 GHz (W2003 Server)*** (more apples vs. oranges) Task: Search for a substructures in a 3 million compound database (NCI2000x12) Take home message: Instant-JChem is fast (nothing more). Source: Instant-JChem (own system), JChem (ChemAxon website) Socrates*: 4x Dual Opteron 870 2GHz; CentOS 64-bit, 32 GByte RAM, 4 GB set for JAVA Opteron**: Dual Opteron 254 (2,8 GHz); WINXP-32bit; 2.88 GByte RAM (10 GByte/s transfer );ARECA-1120 RAID5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1,2 GByte ( read write 1000 MByte/s transfer) Xeon: Dual Intel Xeon 3GHz, 2GB memory, 160GB IDE hard drive; Windows 2003 5.2; Oracle 9.2.0.7.0 DB buffer 1 GB; 1.5.0_06-b05 Apache Tomcat/5.5.12
A 20 million compound DB with Instant-JChem in a local Derby DB (WinXP-32bit) • Import is heavily disk dependent • several hundred million read/write operations to disk (JAVA writes in 4 KB chunks) • JAVA heap space used during import is around 600 MByte • import time is not linear anymore • WIN XP 32-bit + NTFS desperatly try to cache the 6 GByte database file, • even if there is only 3 GByte memory maximum available (1 GByte max for cache). • index creation (import smiles): 20h (too long) • open index for search: 1 min • substructure search: > 1min (to long) • 20 Mio currently to large for Instant-JChem v1.0 use JChem+Oracle (or MySQL, MS SQL) • Aim: Full PubChem data (15-20 Mio) locally
Some general JAVA + JChem speed advices • Always use server JVM (check directory bin\client and bin\server)check batch or sh file options for JAVA –server xyz xyz.jar • Use 64-bit systems; the JAVA maximum heap space for LINUX or WINas 32-bit system is only 1.6 GByte -Xms=1600m • Use only multicore machines (AMD Opterons, Intel Quad) • Use the fastest disks you can buy (WD Raptor) or use RAID5 or RAID6for large files (PubChem SDF data for 5 Mio compounds = 30 GByte) • Give Instant-JChem as much memory as you have - minimum 500 MBytefor extreme speed (no wait time for searches)
Let’s not forget competitors Many good systems exist: MDL (ISIS Base), ACDLabs (ACD/ChemFolder Enterprise), Tripos (Sybyl+Auspyx), Molecular Networks (Carol), CDK and Taverna, Accelrys (Accord), Daylight (Thor and Merlin), CambridgeSoft (ChemOffice Enterprise), Molsoft (ICM+MolCart) Why is ChemAxon better? • Two reasons: • The programs work under WINDOWS and LINUX • ChemAxon has the best and most responsive public forum: • Critics is taken seriously, requested features are implemented ASAP,and a public response within 1-3 days. WHY?Many commercial licencees. • Remember, for academics all free.
Results and conclusion JChem Oracle vs. Instant-JChem • Instant-JChem+Derby is as fast or faster than JChem+Oracle for DBs < 3 Mio • If you want to have fun and results at your fingertip: Instant-JChem • If you want extreme flexibility and you know JAVA+SQL: JChem-Oracle • We are far away from handling billions of structures in a DB (with modest efforts)We will handle such large number of structures file stream based with cluster support. • Software producers (in general) need to put more efforts into software development • for multi-core CPUs + clusters under Windows and LINUX.