10 likes | 140 Views
mmiller:. Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S.
E N D
mmiller: Baldridge, K.; Baru, C.; Bourne, P.; Clingman, E.; Cotofana, C.; Ferguson, C.; Fountain, A.; Greenberg, J.; Jermanis, D.; Li, W.; Matthews, J.; Miller, M.; Mitchell, J.; Mosley, M.; Pekurovsky, D.; Quinn, G.B.; Rowley, j.; Shindyalov, I.; Smith, C.; Stoner, D.; Veretnik, S. San Diego Supercomputer Center, MC 0505, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA The Need for Protein Annotation The EOL Model Innovative Data Access The Encyclopedia of Life (EOL) ProjectAn initiative to analyze and provide annotation for putative protein sequences from all publicly available genome data • Accompanying the massive supply of genomic data is a need to annotate proteins from structural and functional points of view. Questions that researchers look to answer using the massive amount of new genomic data include: • What other genomic proteins are similar to the protein that I am researching? • What level of conservation is there for a particular protein sequence across species? • Which protein domains are common to various protein sequences? • What is the likely cellular location of a specific protein or class of proteins? • On a limited basis, researchers are able to manually perform BLAST searches, sequence analysis and data collation for small collections of protein sequences of interest, but for the very large numbers of sequences (10,000 to 15,000 or greater) coded for in an individual genome, this becomes impractical. • Therefore, key to large-scale genomic sequence analysis is the creation of a reliable and automated software “pipeline” to handle both the analysis functions and then the collation of output data from the analysis. TeraGrid Sequence data from genomic sequencing projects Ported pipeline applications Pipeline data Load/update scripts MySQL DataMart(s) Data warehouse Domain location prediction Structure assignment by 123D Structure assignment by PSI-BLAST Publish Web Services & API Application server SOAP/Web Server UDDI directory The Sequence Analysis Pipeline EOL Web pages served via JSP EOL Notebook Data incorporated into third party web pages Automated data downloads to mirrors and researchers WWW mmiller: This should read “all available genome sequences” Genomic Pipeline Figure 3 Book Metaphor Web Interface Arabidopsis Protein sequences sequence info structure info NR, PFAM Prediction of : signal peptides (SignalP, PSORT) transmembrane (TMHMM, PSORT) coiled coils (COILS) low complexity regions (SEG) SCOP, PDB Figure 2 The EOL Data Analysis and Delivery Model An unique aspect of the EOL model is its ability to deliver data through multiple routes.One arm of this data delivery system is the Web interface, driven by Java Server Pages (JSP). Building on the “Encyclopedia of Life” concept, the interface provides fast access to EOL data through a book metaphor design. Data is cataloged alphabetically by species, and the user is provided with multiple additional tools to search sequence data, including: Building FOLDLIB: PDB chains SCOP domains PDP domains CE matches PDB vs. SCOP 90% sequence non-identical minimum size 25 aa coverage (90%, gaps <30, ends<30) Create PSI-BLAST profiles for Protein sequences The EOL model (Figure 2), applies the iGAP pipeline (proven by the PAT project) al available (cuurently 800+) genomes. It is a key goal of the the project to provide the computational and storage resources necessary to accommodate the analysis of this magnitude of sequnce data (current esitmates are 300 cpu years with available hardware). Ongoing efforts are aimed at obtaining more cpu resources, and improving the efficiency of computational resource utilization. Structural assignment of domains by PSI-BLAST on FOLDLIB Only sequences w/out A-prediction • BLAST search with a protein query sequence to one or more specific species data. • Keyword search. • Natural Language Query search. • Sequence identifier (accession ID) search. • SCOP Fold browser. • Putative function browser. Structural assignment of domains by 123D on FOLDLIB • Stages in EOL Data Processing and Delivery • Publicly available genomic sequence data are obtained via a high-speed Internet 2 connection from NCBI to the San Diego Supercomputer Center. • Sequence data is distributed to several large-scale computing resources such as at partner institutions, such as the BII in Singapore; and the TeraGrid at SDSC (see below), to which the PAT software pipeline has been ported. • Data from the pipeline is deposited into a DB2-based multi-species version of the PAT data warehouse schema, and federated with data from a number of other local database projects. • Multiple complex queries on the data are run and the results are stored in the data base. • Data is loaded into multiple data marts for fast, read-only query access/distribution to both end-users (via a Web interface and a SOAP-based Web services paradigm), and to EOL data mirror sites. • Researchers throughout the world are able to access the data by pointing their Web browser to the EOL data Web site or one of its mirrors. Additionally, the World Wide Web Consortium (W3C) standards-based Web Service protocol allows for peer-to-peer automated computer data access for a variety of uses. Only sequences w/out A-prediction Functional assignment by PFAM, NR, PSI-Pred assignments Domain location prediction by sequence FOLDLIB Store assigned regions in the DB Figure 1 Genomic analysis pipeline used to analyze Arabidopsis thaliana sequence data Query results will be returned in multiple forms, including a Web page summary at the genome, sequence, and structure data levels; as well as by links to the same information in XML, a PDF printer-friendly output, EOL notebook version (see below), and a narrated summary in Flash. The Web interfaces make extensive use of Scalable Vector Graphics (SVG) components to deliver fast, client-side graphical data renderings using XML encapsulated data accoroding to W3C standards. An example is the SVG “chromosome mapper” shown in Figure 4. SVG molecular rendering is used at the client side to provide fast, interactive, and visually informative molecular graphics. mmiller: Strike out arabidopsis The Proteins of Arabidopsis thaliana (PAT) project was a prototype initiative to establish a reliable and accurate pipeline for genome annotation (iGAP) (Figure 1). Using homology modeling, the iGAPprovides functional annotations and predicts three-dimensional structures (where possible) for proteins encoded in the Arabidopsis thaliana genome. The results from iGAP (BLAST-WU, PSI-BLAST, 123D+, COILS, TMHMM, SignalP) were combined and organized into a relational database with a web-based GUI. Steps in Protein Annotation The end-user experience of accessing data processed in this manner is fast, comprehensive and flexible. • Structural assignment by sequence similarity and fold recognition. • Fold assignment. • Function assignment. • Modeling by aligning with template. Large-Scale Computing Resources and Data Storage • Functional assignment by sequence similarity. • Assignment of special classes (filtering). Key to the success of the EOL project has been the ability to partner with computing projects that will provide the resources to drive the software pipeline to process over 800 available genomes. Large-scale computing resources being recruited for the EOL project include the TeraGrid, the world's largest, fastest, most comprehensive, distributed infrastructure for open scientific research (http://www.teragrid.org), PRAGMA, an open organization in which Pacific Rim institutions formally collaborate to develop grid-enabled applications and to deploy Grid infrastructure throughout the Pacific region (http://pragma.ucsd.edu), and NRAC reources, including SDSC’s Blue Horizon; University of Michigans AMD cluster, and the University of Wisconsin Condor Flock. Another factor in the development of EOL has been the ability to deploy large-scale, mass storage to handle the enormous amount of data generated by iGAP analyses and loaded into EOL data warehouse schema and data marts. Ultimately more than 10 terabytes of storage will be deployed for genome annotation alone. • Assignment of protein features. An important issue in this process is automation and its associated automated quality assessment. In the pipeline model, this was addressed by: • Introduction of six reliability categories. • Introduction of benchmark based on 1000 non-redundant SCOP folds [Murzin, AG; Brenner, SE; Hubbard, T; Chothia, C. J. Mol. Biol., 1995, 247:536]. • Testing a variety of search conditions and methods within this benchmark. Further information about the PAT project may be found at the PAT web site: http://arabidopsis.sdsc.edu Figure 4 Client-side data rendering using SVG Reliability Categories (based on selectivity benchmark): A. Certain (99.9% of true positives among predicted positives) Web Services and the EOL Notebook B. Reliable (99%) C. Probable (90%) Sensitivity = tp/(tp+fn) Selectivity = tp/(tp+fp) D. Possible (50%) E. Potential (10%) In addition to obtaining access to EOL data via the web, other components of data delivery include publication of Web Services-based API; and the SDSC Blue Titan web services network direction system. Through Web Services, any researcher or data service is able to access EOL data automatically and with minimal programmatic effort. The EOL notebook is a subproject within EOL (and bioinformatics.org) to create a Java-based application, distributed via JNLP, that will act as a local repository for EOL data. In addition to being able to store and search data locally, the EOL notebook will also be a consumer of EOL Web Services and, via automation, will ensure locally kept data (stored in XML format for interoperability) is kept in sync with data in the main EOL repository. F. No annotation Multiple EOL Data Mirror Sites Data mirrors will be a major component in the EOL data distribution system. A software package can be downloaded from the EOL interface that allows researchers to store selected EOL data on local machines, and, if desired, the software makes it possible to act as a public EOL data mirror. This mirror package software will be based upon a freely available relational database management system (MySQL) and application server (JBoss). This ensures the widest possible deployment of an EOL mirror data repository, from major university and biotech sites to the smallest research institutions, even including high schools, For further information about EOL, please visit us online at: http://www.eolproject.info or contactMark Miller at mmiller@sdsc.edu, +1-858-822-0866