440 likes | 576 Views
Building CryptoDB using GUS. Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu. Genomic Data Analysis Results GUS Plugins Tomcat WDK Apache. External Resources: NCBI Taxonomy (SRes) SO (SRes) NRDB (DoTS) Our data (DoTS).
E N D
Building CryptoDB using GUS Mark Heiges Center for Tropical and Emerging Global Diseases University of Georgia mheiges@uga.edu
Genomic Data Analysis Results GUS Plugins Tomcat WDK Apache
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper script • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results
Site Design Considerations • data types we wanted to warehouse • additional analyses desired • how to load data into GUS • how to visualize data • tables • text • graphics (interactive, static) • what types of questions will be asked of the data
Deciding Factors • What data was available. • What the research community needed. • What we could accomplish by the contractual deadline for our first release.
Crypto External Resource Data • Genomic sequence and gene annotations for two species (GenBank) • sequence • CDS translations • gene product descriptions • exon coordinates • RNA type (mRNA, tRNA, snoRNA, rRNA) • other features • EST/mRNA (GenBank)
Auxillary Data Required • NRDB • NCBI Taxonomy Reference • Sequence Ontology Definitions
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results
GUS Plugins • Perl modules for loading data into GUS • facilities to connect to the GUS perl object layer and the database • process command line arguments • create tracking information in the database • log and handle errors
GUS Plugins • Supported and Community plugins bundled with GUS • Plugins are versioned • Each plugin version must be registered with GUS before use • records cvs version and md5 checksum • auditing
Data Loading at CryptoDB • Install GUS • Register selected plugins • Load Controlled Vocabularies • NCBI Taxonomy • Sequence Ontology Definitions • Load Crypto annotated sequences from GenBank records • Load NRDB from FASTA file
Data Loading at CryptoDB • Load Crypto mRNA GenBank records • Load ESTs from U Penn's database of NCBI's dbEST
CryptoDB Analyses • BLASTP - compare annotated proteins to nrdb • BLASTX - compare whole genome to nrdb • BLASTN - synteny comparison of the two Crypto species we host • EST/mRNA clustering and alignment • signal peptide predictions • transmembrane predictions
Analysis Workflow • Load Source Data into GUS (NRDB, genomic seqs) • Dump same data from GUS with GUS Ids • Perform analysis with this data (BLASTX) • Load results into GUS • GUS Ids allow results to be linked back to analysis input data
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper script • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Analysis Results
Data Analysis - BLASTP • Dump NRDB records from GUS to FASTA file - with GUS Ids >336 source_id=0703290B secondary_identifier=223280 tubulin alpha length=411 TIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAA NNYARGHYTIGKEIIDLVLDRIRKLADQCTGLQGFSVFHSFGGGTGSGFTSLLMERLSVD YGKKSKLEFSIYPARQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE RQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIE • Dump annotated protein sequences from GUS to FASTA file - with GUS Ids
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Analysis Results
Data Analysis - BLASTP • Run BLASTP algorithm with these two GUS Id labeled datasets • used a Perl wrapper to BLAST executable, included with GUS... plugin compatible output • Load BLAST results with plugin • ga GUS::Common::Plugin::LoadBlastSimFast --file blastSimilarity.out --restartAlgInvs "" --queryTable DoTS::ExternalNASequence --subjectTable DoTS::ExternalAASequence --commit
Post Data Loading • Find where the results were loaded • read documentation • ga GUS::Common::LoadBLAST --help • looked in plugin source code • asked other users • gusdb.org schema browser • fishing expeditions in GUS tables
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results
Web Development Kit (WDK) • provides accelerated development of database driven web sites • define questions and records in model XML file • default JavaServer Pages (JSP) views provided • not specific to GUS • can be used with any RDBMS
WDK Question - Summary - Record Paradigm • Users supply parameter values to a canned question on the website • "Which genes have at least __ exons?" • The result is returned in summary pages that list links to the record pages • Record page - detailed view of data object • text • graphics • tables
Questions Summary Record
WDK Model - View - Controller architecture • Model XML configuration defines • questions • answer summaries • records • View • displays the model • defined in customizable JavaServer pages • Controller • internal, not configurable
WDK Setup • build • write WDK model (WDK comes with Toy site - spent some time with that before hand) • test model from command line • install WDK into Tomcat • customize the view (jsp) pages • integrate Tomcat with Apache - personal preference
WDK Model:Defining Questions <question name="GeneByContig" displayName="Genes by Contig" queryRef="GeneFeatureIds.GeneByContig" summaryAttributesRef="source_id,product,organism,contig" recordClassRef="GeneRecordClasses.GeneRecordClass"> <description>Find gene located on a given contig</description> </question>
<sqlQuery name="GeneByContig" displayName="By Contig" isCacheable='true'> <description> Find Genes By Contig ID. </description> <paramRef ref="params.contig"/> <column name="source_id" isInternal="false"/> <sql> <!-- use CDATA because query includes angle brackets --> <![CDATA[ select g.source_id from dots.genefeature g, dots.naentry nae, dots.sequencetype st, dots.externalNAsequence enas where nae.na_sequence_id = g.na_sequence_id and enas.sequence_type_id = st.sequence_type_id and enas.na_sequence_id = nae.na_sequence_id and st.name = 'contig' and nae.source_id = '$$contig$$' ORDER BY g.source_id ]]> </sql> </sqlQuery>
WDK Model - Record <recordClass idPrefix="" name="GeneRecordClass" type="Gene" attributeOrdering="source_id,exoncount,overview, product,linkout,dnaContext,genomeCompare,tmdata,blastpgraphic, translation,sequence,reference"> <attributeQueryRef ref="GeneAttributes.GeneAttrs"/> <attributeQueryRef ref="GeneAttributes.ExonCount"/> <attributeQueryRef ref="GeneAttributes.TMCount"/> <tableQueryRef ref="GeneTables.BlastP"/> <textAttribute name="overview" displayName="Overview"> <text> <![CDATA[ This <b><i>$$organism$$</i></b> gene spans positions <b>$$start_max$$</b> - <b>$$end_min$$</b> of contig <a href="showRecord.do?id=$$contig$$"><b>$$contig$$</b></a> which maps to chromosome <b>$$chromosome$$</b> ]]> </text> </textAttribute> </recordClass>
Testing the Modelcommand line tools • wdkXml - check xml syntax • wdkSummary - test a summary • wdkQuery - run specific query • wdkRecord - test a record • wdkSanityTest - exercises all queries and records • wdkCache
Install WDK into Tomcat • follow the installation instructions carefully • relies on symbolic links from Tomcat webapp to $GUS_HOME • disallowed by default Tomcat configuration • keep an eye on Tomcat logs for troubleshooting • reload the webapp when model changes • retest on command line • don't forget about the cache
CryptoDB Custom View • Made style changes, added site branding • Added additional form elements • radio buttons, check boxes • 'Flattened out' the questions
CryptoDB Custom View • Record pages - alterations to acheive the desired ordering and placement of text, tables and graphics • Standard JSP tags to embed external objects • GBrowse graphic
Integrate Tomcat with Apache • Apache front end answers all web requests • Serves the static pages and cgi tools • BLAST interface • motif search • BLASTX keyword search • Calls to the WDK are passed to Tomcat
External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results
Pipeline • External Resources: • NCBI Taxonomy (SRes) • SO (SRes) • NRDB (DoTS) • Our data (DoTS) Plugins Web Development Kit GUS helper scripts • Analysis Input: • contigs • proteins • NRDB Plugins Analysis Results