350 likes | 435 Views
SLRITools Project: Providing a Platform for Bioinformatics Research Michel Dumontier Bioinformatics Technology Conference February 3 - 6, 2003. SLRITools Outline. Introduction Open-source Toolkit Foundation Toolkit Projects Future Prospects. GRID Computing Layer. Cell Geometry.
E N D
SLRITools Project:Providing a Platform for Bioinformatics Research Michel DumontierBioinformatics Technology ConferenceFebruary 3 - 6, 2003
SLRITools Outline • Introduction • Open-source • Toolkit Foundation • Toolkit Projects • Future Prospects
GRID Computing Layer CellGeometry Christopher W.V. Hogue Lab:An Engineering Approach TowardsCellular Simulation Whole Cell Visualization Modular Cell Simulation Software Layer Data Access Layer Molecules SeqHound InteractionsReactionsKinetics, PTMs Initial Conditions Expression, Concentration, Localization/distributions microscopy NCBI/EBI/DDBJ PDB BIND Proteomics/Genomics
SLRITools Purpose • Make freely available our sequence and structure manipulation and analysis infrastructure and tool software to the greater benefit of the Bioinformatics community
SLRITools Description • Mainly C-based cross-platform toolkit for dealing with biological information, especially protein structure/function. • Extends the freely available NCBI C/C++ Toolkits and forms the basis for a number of powerful applications • GPL/LGPL/PAL licenses • Currently hosted at http://sourceforge.net/projects/slritools • Training tutorials http://bioinfo.mshri.on.ca/tkcourse/ • Canadian Bioinformatics Workshops http://bioinformatics.ca
SLRITools Projects • SLRI lib - common library that extends NCBI Toolkit • SeqHound - Sequence and Structure Database Management System • BIND - Biomolecular Interaction Network Database • Text Indexer - ASN.1 indexer • NBLAST – Cluster variant of BLAST for NxN comparisons • Kangaroo – Regular expression search of DNA/protein/CDR
Hogue Lab - Source Code BIND TraDES 450,000 lines of source code 22 Person-years of work MoBiDiCK SeqHound Industry Standard 65 lines/day SLRI database NCBI c++ http://sourceforge.net/projects/slritools 2.6M lines of source code 160 Person-years of work NCBI c http://ncbi.nlm.nih.gov/IEB/
SLRITools Outline • Introduction • Open-Source • Toolkit Foundation • Toolkit Projects • Future Prospects
Going Open Source • Subject to the Intellectual Property Policy of Mt. Sinai Hospital • Does the software have the potential to improve patient care ? • Does the software have economic benefits that will fund new research and development? • Patents, Licenses & Publications
Software Licenses Stage 1) “Not Released” • “No license”– internal use only • Protects commercial interest of MSH • distributedfolding Stage 2) “Free to Academics” • Executables provided free, source upon request • Publication • Companies must license from MSH • MCODE, TRADES, SSSF Stage 3) “Public Use License” • GNU Public License • SeqHound, BIND Data Manager, BIND specification • Perl Artistic License/Lesser GNU Public License • SeqHound Remote Interfaces for BioPERL/ C, C++ API SLRI Industrial Liasion Tech Transfer Office Patent IP MSH Boardsubcommittee on commercialization
Open Source Issues • Software Releases • Support
SLRITools Outline • Introduction • Open-Source • Toolkit Foundation • Toolkit Projects • Future Prospects
SLRITools Foundation • National Center for Biotechnology Information (NCBI) • NCBI Toolbox - Information Engineering Branch • http://www.ncbi.nlm.nih.gov/IEB/ • GenBank, Entrez, BLAST, Sequin, OMIM, RefSeq • Data Model – An explicit, complete data model of biological sequences, structures, bibliographic data, and associated annotations • Data Encoding - A formal specification and encoding rules. The telecommunications standard, ASN.1, has been used for this. Recently it has been mapped to a similar language, XML. Provides automatic code generators.
SLRITools Foundation II • Programming Libraries • Originally written in a portable dialect of C. Recently a new generation is being written in C++. • Compiled and occasionally tested over 14 OS • Linux, HPUX, MacOS 9/X, Irix, Solaris, Windows 3.1/95/NT/2000/XP, BeOS, QNX, alpha, BSD, AIX, parisc-Linux, Sony PlayStation2 Linux • 16/32/64 bit hardware • Open Source – Free License • ftp://ftp.ncbi.nih.gov/toolbox/
SLRITools Outline • Introduction • Open-Source • Toolkit Foundation • Toolkit Projects • Future Prospects
SeqHound • SeqHound is a sequence and structure database management system that inherits the NCBI data model and mirrors the NCBI core biological sequence and structure information • Why did we develop SeqHound? • Too many hits to NCBI server -> banned IP! • Data transmission & network connection issues • Generate more sophisticated API to access data currently only available within the NCBI • Faster, local or remote access with a variety of programming languages • Provide functionality necessary to retrieve specialized subsets of sequences, structures and structural domains.
SeqHound Daily Updated • Nucleic Acids • Proteins • 3D Structures • Domains • PubMed Links • Taxonomy • Identifiers • Coding Regions • Genome Sets • Redundancy • Neighbors • GO Annotation • LocusLink • Fielded Text Index • Medline XML/DB2 GFF FASTA Clustal PDB XML ASN.1 150+ functions http://seqhound.mshri.on.ca
SeqHound Resources • SeqHound is accessible via • http://seqhound.mshri.on.ca • Simple web interface (under development) • C, C++, Java (new!), Perl remote API or an optimized local API. (->SOAP?) • Timeline • Redundant fail-over server mid-summer • Concurrent with Bioperl release • Freely available article published in BMC Bioinformatics 2002, 3:32 • http://www.biomedcentral.com/1471-2105/3/32/
BINDBiomolecular Interaction Network Database Motivation: • Massive influx of biomolecular interaction data requires repository, standards and access Goals: • Provide a standard, comprehensive and integrated interaction resource to the scientific community • Define protein function and mechanisms • Recover and integrate biomolecular interaction knowledge (backfilling) • Discover new knowledge through data mining
http://bind.ca Result: • Database to archive and exchange molecular assembly information • Describes • Interactions • Complexes • Pathways • BIND has an extensive data model, GNU software tools and is based on the NCBI toolkit. • Recently funded for a 3 year effort at 25M CDN • CIHR (1M) OGI/Genome Canada (12.5M) Ontario R&D Challenge fund (5.2M) • IBM, MDS Proteomics and Foundry Networks • Sun
BIND Data Policies GenBank Policy • BIND data is freely available for any purpose Direct Submission • Submitters cannot limit the intended use of submitted BIND data • Submitters have the right to edit/alter their records over time • Suggestions made by a third party will be forwarded by us to the submitters to seek approval for any changes or corrections Availability • ftp://ftp.bind.ca • ASN.1/XML data+specification
Molecular Complex Detection (MCODE) • Assume densely connected regions of a heterogeneous interaction network represent molecular complexes • MCODE finds densely connected regions of a graph • Weight nodes by local density (scoring function) • From highest weighted node, recursively add neighbours above threshold score to complex • Evaluation (Yeast): • 88/221 CellZome hand annotated complexes • 64/208 MIPS complexes (166 predicted) • 200 complexes predicted in 15,143 protein interactions from yeast Published: BMC Bioinformatics 2003. 4:2. • http://www.biomedcentral.com/1471-2105/4/2
9-core from ~15,000 yeast interactions Dense Fibrillar Center Fibrillar Center Granular Component
FAST = “parallel” RPS BLAST Used to spot domain similarities in a protein interaction cluster Server-generated scalable FLASHgraphics – zoomable, printable. Followed-up by zoom in on FASTA formatted sequences to see domain superposition and links to SMART/PFAM
NBLAST Description: • NBLAST is a cluster computer variant of BLAST • It performs the minimum number of sequence comparisons and stores sequence alignments and the list of similar sequences (neighbours) as binary ASN.1 (XML) • NBLAST is written in C using the NCBI C Toolkit. • Separate function and database layers Accessibility: via SeqHound • http://seqhound.mshri.on.ca Neighbours DB (codebase) • ftp://ftp.mshri.on.ca/pub/nblast Published: BMC Bioinformatics 2002, 3:13 • http://www.biomedcentral.com/1471-2105/3/13/
Ookpik CFI/ORDCF Funded. 216 P-III 45064 GB 1.2 TB disk NBLAST RPS-BLAST TRADES MoBiDiCK http://bioinfo.mshri.on.ca/yac/ http://sourceforge.net/projects/slritools/
Kangaroo Description: • Kangaroo is implemented to facilitate a wide range of queries with no restriction on the length or complexity of the query expression • Uses regular expression • Search DNA, protein, or coding region • Web-based form and results • Links to SeqHound Accessibility: • http://bioinfo.mshri.on.ca/kangaroo currently supports searches on 10 organisms (including human, mouse) Published: BMC Bioinformatics 2002, 3:20 http://www.biomedcentral.com/1471-2105/3/20
Summary • Robust tools and services based on the NCBI data model • Flexible licensing Future Prospects • BIND/SeqHound Web Services (SOAP) • SeqHound • Web Interface • InterPro|COG • Larger & more sophisticated BIND (JAVA) • Grid Engine & Cell Simulation
BIND Gary Bader Doron Betel SeqHound Katerina Michalickova Protein Folding/CASP Predictions Howard Feldman Species Specific Protein Scoring Functions Michel Dumontier Cell Simulation/Systems Biology Adrian Heilbut Ken Lau FPGA Hardware Database Search Engines Ruth Isserlin Christopher W.V. Hogue Lab Projects/Graduate Students
Database Curation Vicki Lay Susan Moore Brigitte Tuekam Cheryl Wolting Software Engineering Neil Bahroos Ian Donaldson Marc Dumontier Vladimir Grytsan Hao Lieu Greg Pintile John Salama Administration Eric Andrade Marianne Rukavina Sue Sroka Greg Van Volkenburg IT Greg Clark Edward Lee BIND = “Blueprint Initiative”