250 likes | 427 Views
Biomedical Computing Requirements for HPCS. Kay Howell, Federation of American Scientists khowell@fas.org Gerry Higgins, SimQuest, LLC higgins@simquest.com. Biomedical Computing Requirements for HPCS. Examine broad range of application areas
E N D
Biomedical Computing Requirements for HPCS Kay Howell, Federation of American Scientists khowell@fas.org Gerry Higgins, SimQuest, LLC higgins@simquest.com
Biomedical Computing Requirements for HPCS • Examine broad range of application areas • Identify key applications driving computing demand • Identify hardware/software challenges for important classes of applications • Highlight HPCS areas critical to advances in biomedical computing • Identify technology gaps common to biomedical, national security, and other nationally important applications • Demonstrate market potential of HPCS in biomedical computing
Requirements Analysis • System architecture requirements, including: • processors, memory, interconnects, system software, and programming environments • Bandwidth requirements • System robustness • Application development and maintenance • System management, operation, and maintenance
Focus Areas • Resources for managing, analyzing, interpreting data • Extending the time scale & complexity of simulations • Combined classical/quantum chemical simulations • Simulations of large systems • Protein structure prediction • Diagnostic imaging and image-guided interventions
Work Plan • Survey existing information and materials • Interview researchers, sponsors and industrial representatives • Produce preliminary report summarizing findings and distribute for review and comment • Deliver initial reqmts one year after project award • Update the report one year later
Biomedical ComputingWhat we’d like to be able to do… Static Dynamic Functional • Mouse/Human Genome Correlation • Individual Pharmacogenomic analysis using Gene Expression Arrays • Multi-modal Radiology Image Fusion • Millisecond Structural Biology enabled by Synchrotron X-ray Sources and 900 Mhz NMR • Physiologically competent Digital Human Simulations • your additions to the list…
Challenges in Biomedical Computing • Non-linear - current models are simplified linear approximations • System Complexity - need to span multiple scales of biological organization • Time Scales • Exponential increases in data
Biomedical Computing Problems 106 103 100 106 103 100 106 103 100 106 103 100 103 109 10-12 10-9 10-6 10-3 100 106 Complexity and Timescale Organisms Discrete Automata models Finite element models Evolutionary Processes Ecosystems and Epidemiology Organ function Cells Electrostatic continuum models Cell signalling Size Scale DNA replication Biopolymers Enzyme Mechanisms Ab initio Quantum Chemistry Protein Folding Empirical force field Molecular Dynamics Atoms Homology-based Protein modeling First Principles Molecular Dynamics Geologic & Evolutionary Timescales 10-15 Timescale (seconds) ORNL
Biomedical Computing Requirements for HPCS Application Areas
Biological Research Requiring ultra-HPC Resources • Structure of proteosome, ribozyme, ribosome, ATPases, Virus, membrane protein complexes • Whole genome comparison • Combined quantum/classical simulations • Protein folding/threading • Microsecond time-scale simulations • Self-organization and self-assembly • Protein-protein and protein-DNA recognition and assembly Your additions….
Sequencing and Analysis • Key Attributes: • Integer intensive • Significant research into new kinds of statistical models: hybrids of HMMs and neural nets, dynamic Bayesian nets, factorial HMMs, Boltzmann trees • Clusters typically used • Large scale database infrastructure common • Cluster can be dedicated to single task/local data control • Cycle requirements can be substantial because of data • Systems often in excess of 1Tflop (range 1-5)
Protein Structure Prediction Summary of Computational Characteristics • Pipeline processing (network of interrelated tasks) • Clustering: • Computationally intensive • Algorithms easier to implement using shared memory parallelism due to tight coupling, fine grained, non-uniform work load • Generation of sequence fragments: • ANN algorithm may be ideal for this and for clustering purposes • Fragment library written to a database • Compute intensive algorithms are clustering (ANN) and optimization (GA) • Optimization easier to implement using loosely coupled distributed compute cluster
Protein Structure PredicationWish List • Hardware/software to map the processing pipeline efficiently • Tools to schedule such a pipeline, checkpoint • Well balanced hardware pipeline from archival storage to the compute elements without bottlenecks • Easily programmable FPGA coprocessor boards to handle integer and other DSP branch of the pipeline • Hardware and software that can handle truly asynchronous computing as it is the key to scalability (overlapped computation, communication and I/O) • Efficient ANN and GA libraries similar to LAPACK • Efficient skeleton/template codes for common computation/communication/IO (OO jargon patterns) across all platforms • Standardized Framework, libraries, database providing the computational characteristics of the underlying hardware/software environment Source: G. Chukkapalli, UCSD
Protein Structure PredictionFuture requirements • Combine knowledge based prediction with ab initio methods to improve the prediction accuracy • Execute the whole pipeline on demand in an automated fashion • Generate predicted structures for whole genomes • Protein design: inverse problem All these are prohibitively expensive at present
Molecular Level Modeling • Biochemical analysis • Protein binding /drug target evaluation • Dynamics of molecules • Very large systems with physics
Computational Biology HPC Challenges Source: S. Burke, NIH
Data Management • Data management issues will be critically important • Growth rate of biological data is estimated to be doubling every 6 months • GenBank grew from 680,338 base pairs in 1982 to 22 billion base pairs in 2002 (compared to 13.5 base pairs as of August 2001 • Rate of data acquistion 100X higher than originally anticipated due to improved sequencing technology and methods • Redundancies and database asynchrony is increasing - data-base-to-database comparisons are required for analysis and validation • To look at long-range patterns of expression synthetic regions on the order of 10’s of megabases become reasonable lengths for consideration What other data issues should be highlighted?
Data Management Issues • New Types of Data Support to extend existing RDBMS: • Sequences and Strings • Trees and Clusters • Networks and Pathways • Deep Images • 3D Models and Shapes • Molecules and Coordinate Structures • Hierarchical Models and Systems Descriptions • Time Series and Sets • Probabilities and Confidence Factors • Visualizations Source: Davidson, Bristol-Myers Squibb Pharm. Res. Institute
Systems Biology – Modeling the Cellular System • Combine cell signaling, gene regulatory and metabolic networks to simulate cell behavior • Hybrid information & physics based model Integrating Computational/Experimental Data at all levels • Modeling of network connectivity (sets of reactions: proteins, small molecules, stochastic, MD) • Difficult to handle computationally • importance of spatial location within the cell • instability associated with reactions between small numbers of molecular species • combinatorial explosion of large numbers of different species • >Petaflop problem
Systems Biology • Need to simulate gene expression, metabolism and signal transduction for a single and multiple cells • Algorithms need to be designed precisely for biological research -parameter optimizer needs to find as many local minima, including global minima, as possible because there are multiple possible solutions of which only one is actually used • Must be able to simulate both high concentration of proteins that can be described by differential equations and low concentration of proteins that need to be handled by stochastic process simulation • Stochastic methods are being used (STOCHSIM and Gillespie algorithm) • individual molecules represented rather than concentrations of molecular species; Monte Carlo methods are used to predict interactions • rate equations are replaced by individual reaction probabilities
Digital Imaging • Used for monitoring of disease progression, diagnosis, preoperative planning and intraoperative guidance and monitoring • Algorithms are computationally demanding • Key issues are segmentation and registration • Signal processing techniques are used to enhance features and generate the desired segmentation • Results of the segmentation are aligned to other data acquisitions and to the actual patient during procedures • Results of the segmentation are visualized using different rendering methods
Digital Imaging Source: R. Kikinis, Brigham and Women's Hospital and Harvard Medical School