160 likes | 310 Views
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases. Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman. Motivation. Biological databases are growing at a very high rate
E N D
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman
Motivation • Biological databases are growing at a very high rate • Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years • Computational tools required to efficiently access and analyze this data • Typical data analyses • Linear scans across database looking for something • “all-versus-all” comparisons within database • High performance distributed computing resources can play important role in these analyses • Authors use a distributed computing environment, LEGION, to enable large scale analysis on PDB CMSC 838T – Presentation
Motivation • Similar to evaluation of threaded-blast project • We run threaded blast over Sun SMP with 24 processors • Authors run program called FEATURE over LEGION framework • Can access hundreds of CPUs worldwide • Can spawn sequential versions of FEATURE on all of them CMSC 838T – Presentation
Talk Overview • Overview of talk • Motivation • Background • LEGION • FEATURE • Methods • Experiments • Results • Discussions • Related work • Observations CMSC 838T – Presentation
Background • LEGION (Worldwide Virtual Computer) • Metacomputing environment comprised of geographically distributed, heterogeneous collections of workstations and supercomputers • Connects resources to make up a single, worldwide, virtual computer • Coordinates large number of parallel jobs on a mixture of processors SMPs, MPPs, PCs on any network • Legion provides the software infrastructure so that a system of heterogeneous, geographically distributed, high performance machines can interact seamlessly. • No manual installation of binaries over multiple platforms (LEGION does it automatically) CMSC 838T – Presentation
Background • LEGION • LAM - MPI implementation for workstation clusters • Legion supports transparent scheduling, data management, fault tolerance, site autonomy, single file name space , efficient scheduling comprehensive resource management, and a wide range of security options. CMSC 838T – Presentation
Background • FEATURE • Site characterization and recognition system • Site is a microenvironment distinguished by some structural or functional role • Identifies functional or structural sites of interest in query protein CMSC 838T – Presentation
Background • FEATURE • Measures spatial distributions of chemical and physical properties to create statistical model of microenvironment • Compares regions of query protein with known sites and control non-sites and assigns scores indicating likelihood of region being site • Produces list of potential sites locations with corresponding scores • Has been used to recognize ion, ligand and enzyme binding sites • FEATURE is typical data-driven algorithm requiring large data storage and efficient data analysis • Requires 12 hours on single processor to evaluate 580 non-redundant PDB entries CMSC 838T – Presentation
Methods • FEATURE run on all protein entries in May 2000 PDB • Searched for potential Calcium binding sites • FEATURE has 90% sensitivity and 100% specificity to this • Three experiments conducted • Sequential scan of PDB subset using single processor • Comprehensive scan of PDB using LEGION system using 50 processors • Set of runs of LEGION using constant PDB subset but varying processors • Input parameters to FEATURE and statistical model for Ca remained constant CMSC 838T – Presentation
Methods • Experiments • Sequentially scanned arbitrary 726 proteins from PDB • Runs made on single processor Sun E450 machine with 300 MHz Ultra-Sparc CPU • Comprehensive scan of all proteins (10,996 total) in PDB • Maximum # of processors: 50 • FEATURE code compiled for various platforms so binaries can be run on different machines across LEGION • Scanned subset of proteins with varying number of processors • Arbitrarily selected 4997 proteins for each run • Varied number of processors using values 20, 40, 60, and 80 CMSC 838T – Presentation
Results • FEATURE reported six run time failures due to non-standard PDB file formats for sequential run • FEATURE also run time assertion failures, illegal instructions or segmentation faults during second experiment CMSC 838T – Presentation
Results CMSC 838T – Presentation
Discussion • FEATURE performance deteriorates after # of processors exceeds 60 • Optimal max number is constrained by • client’s process table which keeps track of each LEGION process spawned • amount of memory available to support spawned processes • Thus even if LEGION contains 100s of nodes, users cannot use them • Also LEGION provides minimal fault-tolerance (if any instance fails user must wait till everything has finished to re-spawn) • Authors maintained local copy of database but concede that this is not realistic situation as • updates to PDB occur frequently • Consumes lot of disk space CMSC 838T – Presentation
Related Work • Threaded BLAST and MPI Blast • Authors work is similar to threaded blast • MPI Blast is a parallelized version of Blast so single query can be split across multiple processors • FEATURE is not truly parallelized CMSC 838T – Presentation
Observations • Running CPU intensive tasks over many processors is definitely useful • However, LEGION does not scale well as there is performance degradation after 60 processors • They have not utilized true parallelism in FEATURE • It seems to me that there is lot of potential to parallelize FEATURE given that many potential sites can be examined simultaneously • What is performance enhancement in parallelized version? CMSC 838T – Presentation
Questions CMSC 838T – Presentation