290 likes | 398 Views
INSDC Sequencing Project Registry: NCBI web service protocol. Use and step-by-step description. National Center for Biotechnology Information, NIH, Bethesda, MD. USA. Project definition.
E N D
INSDC Sequencing Project Registry: NCBI web service protocol Use and step-by-step description National Center for Biotechnology Information, NIH, Bethesda, MD. USA
Project definition • A project is defined as a collection of INSDC records originating from a single organization, or from a consortium of coordinating organizations. • The collective database records from a project make a complete genome from a single organism studies or a metagenome comprising communities of organisms. • A project may contain genomic sequences, EST libraries and any other sequences that contribute to the assembly and annotation of the genome or metagenome
Field definitions Assigned by INSDC project ID, locus-tag prefix Mandatory fields submitter contact info; submitting organization project type (single organism or metagenomic) project name (for metagenomic); organism name (for single organism) strain/isolate/breed (for single organism) physical source of material (for single organism) Optional fields project description; project URL: replicon names, estimated sizes; sequencing method; sequencing depth; estimated/calculated genome size
5 Large-scale EST sequencing (complete) Center D 3 Assembly and annotation (incomplete) Center E Nucleotide data at NCBI (dbEST) Nucleotide data at NCBI (GenBank) Schematic diagram of a generic eukaryotic genome project 6 Large-scale cDNA sequencing (incomplete) Center B 1 Genomic sequencing (WGS) and assembly and annotation (complete) Center B Genomic data at NCBI (RefSeq) Organism-specific overview Links to third-party sites 2 Genomic sequencing (WGS) (complete) Center A Nucleotide data at NCBI (GenBank) 4 BAC-ends sequencing (incomplete) Center F project overview external data NCBI data
New field in INSDC record International Nucleotide Sequence Database Collaboration Locus-tag prefix for annotated genes
NCBI genome project submission CGI EMBL genome project submission CGI INSDC project NCBI Server DDBJ genome project submission CGI NCBI Project Database http://www.ncbi.nlm.nih.gov/projects/gpws
Web services are web-based enterprise applications that use open, XML-based standards and transport protocols to exchange data with calling clients • WSDL is an XML-based service description on how to communicate using the web service; namely, the protocol bindings and message formats required to interact with the web services listed in its directory. • WSDL is often used in combination with SOAP and XML Schema to provide web services over the internet. A client program connecting to a web service can read the WSDL to determine what functions are available on the server. Any special datatypes used are embedded in the WSDL file in the form of XML Schema. The client can then use SOAP to actually call one of the functions listed in the WSDL.
NCBI Web service implementation • Web service methods: Submit Project Update Project Delete Project Check Status Get Document ID Get Document Others : Bulk dump Conflict resolution
Submitting a new project – eSubmit(example successful submission) eSubmit names Locus_tag_prefix Collab CGI NCBI Server eOK eNone Normal case with requested Locus_tag prefix
New project submission - eSubmit (inconsistent request) eSubmit names Flag to auto assign locus_tag && Locus_tag_prefix Collab CGI NCBI Server eError eProvidedLocusTagPrefixWillBeIgnored Data Error: If CSubmission.AutoAssignment flag is set and pLocusTagPrefix is provided by the submitter.
Providing Reliability NCBI is providing dual middleware and SQL servers Sql Server1 In this case, the choice of which API server is used is by load balancing, even when both middleware servers are available NCBI api Server Data are stored redundantly on two SQL Servers Having Both or any one API server available provides full functionality Sql Server2 NCBI api Server
NCBI api Server Normal handling of conflicting request Reject New request Sql Server1 Conflicting request New request New request When both servers are up, no problem, both get the new request. Sql Server2 So, in this state, if a Conflicting request incompatible with the new request is made, it can be rejected, as it should be. New request
There are multiple RARE reasons why a valid request could have problems. • Connection to NCBI could be down, anywhere in between the Collab CGI and NCBI. Among rare events, we expect this to be the most common problem. • The entire NCBI site could be down. Historically, this has been extremely rare. • One or both of the database servers hosting the service could be down. (See later slides for partial service provided with one server up.) • (If any NCBI middleware API server is up, request handled.)
Benefits of Redundant SQL Servers • If any Server is up, requests for information can be handled. • If any server is up, submissions for project IDs and locus_tags can be accepted. • Normally, a server going down and coming back requires the only minimal action of checking back to confirm that the state is now ok.
NCBI api Server Expected transient case that can be handled automatically Reject Received, not confirmed New request Sql Server1 Conflicting request New request NCBI Maintenance task New request The Collab API would receive the status “eReceived”, until the maintainence task completed, then for the “new request”, it would then receive the “eConfirmed” status. When a server is down, but then comes back up the request would normally be propagated by NCBI maintenance tasks. Sql Server2 So, in this corrected state, if a Conflicting request incompatible with the new request is made, it can still be rejected, as it should be. New request
Collab handling of “eReceived” • Following slides will provide more information about why the “eReceived” return is necessary as a possible return. • To handle it, the collaborators can check back to confirm that the status has matured to “eConfirmed”, or to see if a problem was detected. • The possible EXTREMELY RARE and UNLIKELY problems will be presented in following slides.
Two Phase Commit • Computer Scientists might recognize the problem as a natural consequence of a two phase commit. • Normally, the two phases are hidden from submitters. • If the second phase is blocked by a server being down, then this complexity is revealed by the receipt of the “eReceived” status.
Unavoidable Complexity Caused by Redundant SQL Servers • Redundant SQL Servers both prevent data loss and maximize uptime for queries. That is why we choose to accept the complexity of the two phase commit. • Even in this case the request can be “accepted”, but confirmation has to be after a delay.
Why bother with the two phase commit at all? • Although expected to be EXTREMELY RARE and UNLIKELY, the following slide shows a sequence of events prevented by the current system. • This slide shows what would NOT HAPPEN in the proposed system because of the two phase commit. • The following slide shows what would happen WITHOUT the two phase commit.
Sql Server1 Sql Server2 NCBI api Server Illustration of what we will not allow and must protect against Accepted! Unacceptable state prevented by two stage commit New request New request Conflicting request But, when one server is down, then come back and the first comes down, watch what can happen! So, in this state, if a conflicting request incompatible with the new request it could be accepted, leading to an unacceptable data state. Conflicting request
Why this event is expected to be so rare • This event requires the following sequence: • A server “2” going down. • Server “1” accepting a request, then going down, while • Server “2” comes back up to accept the conflicting request.
How this event would be handled. • Should this unlikely event happen, instead of the status maturing from eReceived to eConfirmed, it would degrade to eConflict for both. • The desired correction would be decided among the collaborators, by dialog. • Database would be patched to reflect the desired outcome.
Illustration of rare event • The following slide illustrates the sequence of events should this rare sequence of events occur. • It may never happen, but the two phase commit makes is possible, so we want to be clear at the beginning, what would happen, and how it would be handled.
Sql Server1 Sql Server2 NCBI api Server Proposed responses should this happen Received, not confirmed Received, not confirmed New request New request Conflicting request But, when one server is down, then come back and the first comes down, this is what we should do: So, both requests are received, but not confirmed. Processes running on our servers will detect this for manual attention. Conflicting request
The previous should be a rare event • Then why bother handling it? Because: • The cost of automatically making this mistake would be high, and • The more normal, typical, frequent and expected recovery, as on previous slides, are handled automatically. • It will be noticed by a eReceived state degrading to eConflict.
Handling of eReceived • All of the rare cases are noticed by the receipt of eReceived. • Collaborators need to check back for changes to the status of eReceived projects. • If the status matures to eConfirmed, no further action is needed. • If the status degrades to eConflict, then discussion will be needed. This will be rare!
Non-collab users of the NCBI Genome Project data • Public data is available in Entrez • Use eUtils (also implemented as NCBI web service) • Discussion on the data elements in Etrez Genome Project Docsum • ftp dumps