280 likes | 429 Views
What’s new in JKlustor. Miklós Vargyas. UGM 2006. Overview. An introduction to JKlustor Brief history of the product Main features Usage examples Performance LibMCS, an alternative approach to clustering chemical structures Concepts, motivation Features Performance
E N D
What’s new in JKlustor Miklós Vargyas UGM 2006
Overview • An introduction to JKlustor • Brief history of the product • Main features • Usage examples • Performance • LibMCS, an alternative approach to clustering chemical structures • Concepts, motivation • Features • Performance • Future of JKlustor
Brief history of JKlustor • First discovery tool in the JChem package • Jarp released in version 1.5.2 (March 22, 2001) • Compr 1.5.7 (May 27, 2001) • Ward 1.5.9 (Jun 25, 2001) • API released in JChem 1.6.2 (May 16, 2002) • Experimental LibMCS first released in JChem 3.0 (Dec 1, 2004) • New JKlustor GUI to be released in JChem 3.?
JKlustor features • Similarity based clustering • ChemAxon’s topological fingerprint • External data points, arbitrary dimension • Tanimoto, weighted Euclidean • Hierarchical clustering: Ward • Reciprocal nearest neighbor algorithm • Kelley method • Non-hierarchical clustering: Jarvis-Patrick • Diversity calculation: Compr • Structure based clustering: LibMCS
JKlustor usage • Command line tools • Pipelining commands • Option flags • Structure file/database input • Manual creation of cluster views Input SDFile GenerateMD NNeib JarvisPatrick CreateView MarvinView Picture
JKlustor usage • Prepare data and run clustering generatemd c input.sdf -k CF -c cfp.xml -D -o fingerprints.txt nneib -f 512 -t 0.1 -g –i fingerprints.txt –o neighborlists.txt jarp -c 0.2 -y –i neighborlists.txt –o clusters.txt • View first cluster crview -i id -c "clid=1" -s input.sdf -t clusters.txt –o jarp_cluster1.sdf mview –c 3 -r 3 jarp_cluster1.sdf • View centroids, display cluster id and size crview-i "centr:2" -c "size>=20" -d "clid:size" -s input.sdf -t clusters.txt -o jarp_centroids.sdf mview -c 3 -r 3 -f "clid:size" jarp_centroids.sdf
JKlustor performance • Memory: O(n) • Time: Jarvis-Patrick O(n1.5), Ward O(n2)
What is MCS? • The Maximum Common Substructure of two chemical structures
Clustering by MCS? • Find the MCS of a group of structures
Very brief history of LibMCS • Reaction automapper, based on Maximum Common Subgraph Search • MCS class API made public • Customer requested MCS based clustering • More intuitive than similarity based • Focused set analysis • screens: 2000 – 10000 structures • lead optimization: 3000 – 5000 structures • Should be hierarchical (outliers) • Ultimate goal: cluster 5000 compounds in 5 seconds
LibMCS features • MCS based hierarchical clustering • Flexible search options • Hierarchy browser • Filtering by chemical properties • Cluster statistics • No size limitation • Fast operation
LibMCS – Output files CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)CN(C2CCCC2)C1=O 0 21 0 CCCN1CC(=O)SCC(C)C1=O CC1CSC(=O)C2CCCN2C1=O 0 21 0 OC(=O)C1CCCN1C(=O)CCS CC(CS)C(=O)N1CCCC1C(O)=O 0 19 0 OC(=O)C1CCCN1C(=O)CCS [H]C1(CCCN1C(=O)CCS)C(O)=O 0 19 0 OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCC2SC(=O)C3=CC=CC=C3 0 19 0 OC(=O)C1CCCN1C(=O)CCS OC(=O)C1CCCN1C(=O)C2CCCCC2S 0 19 0 CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1SC(=O)C2(C)CC3=CC=CC=C3CN2C1=O 0 20 0 CCC(=O)N(CC1=CC=CC=C1)C(C)C=O CC1CSC(=O)C2CC3=C(CN2C1=O)C=CC=C3 0 20 0 CC1SC(=O)C2CCCN2C1=O CC1SC(=O)C2CCCN2C1=O 0 30 0 CC1SC(=O)CNC1=O CC1SC(=O)CNC1=O 0 29 0 OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 OC(=O)C1CSCCCCCCCCC(CS)C(=O)N1 0 31 0 CC(S)C(=O)NCC(O)=O CC(S)C(=O)NCC(O)=O 0 24 0 CCC1=CC=CC=C1 CC(NC(CCC1=CC=CC=C1)C(O)=O)C(=O)N2CCCC2C(O)=O 0 22 0 CCC1=CC=CC=C1 CCOC(=O)C(CC1=CC=CC=C1)NC(=O)NC(CC2=CC=CC=C2)C(=O)OCC 0 22 0 OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 OC(=O)C1CCCN1C(=O)NC2=CC=CC=C2 0 23 0 C\C(Cl)=N/OC(N)=O C\C(Cl)=N/OC(N)=O 0 27 > <Cluster_ID> 1163 > <Element_count> 1 > <Parent_ID> 1 $$$$ Marvin 05290619172D 23 24 0 0 0 0 999 V2000 2.4230 -0.3587 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.0538 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 3.1375 0.8788 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 -0.4349 -1.1837 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 -1.1494 -1.5962 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 -1.8638 -1.1837 0.0000 N 0 0 3 0 0 0 0 0 0 0 0 0
LibMCS – Performance • Depends on • average structure size • total diversity • minimal required MCS size • atom/bond constraints • Scales linearly • Maximum speed achieved • 1 000 structures in 3 seconds • Memory requirements • 100 000 structures occupy 200MB
LibMCS – Further applications • Find the MCS of existing clusters • Data retrieval • Assay analysis • Compound acquisition • Combinatorial library profiling
Development plans • Disconnected MCS • Multi-group clustering • More chemical sense (e.g. avoid opening rings, consider chirality) • Performance tuning (e.g. NN) • Integrate Ward/Jarp into new GUI • Additive clustering • Clustering million compound libraries • Integrate Chemical Terms • Integrate molecular descriptors, optimized metrics
Summary • New tool in JKlustor based on MCS • More plausible grouping • Hierarchical with dendogram browser • Statistics • Filtering, coloring, selection
Acknowledgements • Developers • Ferenc Csizmadia, Árpád Tamási, András Volford, Szilárd Doránt • Péter Vadász, Nóra Máté • Special thanks