140 likes | 302 Views
COMP 5115 Programming Tools in Bioinformatics Week 4 Detail Investigation of the Bioinformatics Functions*: getpdb.
E N D
COMP 5115 Programming Tools in BioinformaticsWeek 4Detail Investigation of the Bioinformatics Functions*: getpdb The Protein Data Bank (PDB) (http://www.pdb.org) is an archive of experimentally determined three-dimensional protein structures and contains 3-D biological macromolecular structure data of proteins. Note that new beta site of PDB replaced the current RCSB PDB portal on January 1, 2006 There are now 39464 structures as of today (23/10/2006) (increased by 413 in 3 weeks) *see http://www.mathworks.com/access/helpdesk/help/toolbox/bioinfo/reffor further details
Detail Investigation of the function codes:getpdb • getpdb retrieves sequence information for a protein from the PDB • Syntax • data=getpdb('PDBid', 'PropertyName', PropertyValue...) • data=getpdb(..., 'ToFile', ToFileValue) • data=getpdb(..., 'MirrorSite', MirrorSiteValue)
Detail Investigation of the function codes:getpdb • Arguments • PDBid:Unique identifier for a protein structure record. Each structure in the PDB is represented by a 4-character alphanumeric identifier. For example, 4hhb is the identification code for hemoglobin. • ToFile: Property to specify the location and filename for saving data. Enter either a filename or a path and filename supported by your system (ASCII text file). • MirrorSite: Property to select Web site. Enter either http://rutgers.rcsb.org/pdb to use the Rutgers University Web site, or enter http://nist.rcsb.org/pdb for the National Institute of Standards and Technology site.
Detail Investigation of the function codes: getpdb • Data = getpdb('PDBid', 'PropertyName',PropertyValue...) searches for the ID in the PDB database and returns a MATLAB structure containing the following fields: • Header, Title, Compound, Source, Keywords, Experiment Data, Authors, Journal, Remark1, Remark2, Remark3, Sequence, HeterogenName, HeterogenSynonym, Formula, Site, Atom, RevisionDate, Superseded, Remark4, Remark5, Heterogen, Helix, Turn, Cryst1, OriginX, Scale, Terminal, HeterogenAtom, Connectivity
Detail Investigation of the function codes: getpdb • getpdb(..., 'ToFile', ToFileValue) • saves the data returned from the database to a file • reads a PDB formatted file back into MATLAB using the function pdbread • getpdb(...,'MirrorSite', MirrorSiteValue) allows a user to choose a mirror site for the PDB database. • The default site is the San Diego Supercomputer Center, http://www.rcsb.org/pdb. • See http://www.rcsb.org/pdb/mirrors.html for a full list of PDB mirror sites. (e.g., www.pdb.org)
Detail Investigation of the function codes: getpdb • Related Bioinformatics Toolbox functions getembl getgenbank getgenpept getpir pdbdistplot pdbplot pdbread • Examples • Retrieve the structure information for Nitrate/Nitrite Response Regulator Protein Narl with PDB ID 1A04. • pdbstruct = getpdb('1A04')
Detail Investigation of the function codes: (getpdb.m) • function pdbstruct=getpdb(pdbID,varargin) • % • if ~usejava('jvm') • error('Bioinfo:getpdb:NeedJVM','%s requires Java.',mfilename); • end • tofile = false; • seqonly = false; • mirrorsite = 'http://www.rcsb.org/pdb'; • if nargin > 1 • if rem(nargin,2) == 0 • error('Bioinfo:getpdb:IncorrectNumberOfArguments',... • 'Incorrect number of arguments to %s.',mfilename); • end • okargs = {'tofile','mirror','sequenceonly'}; • for j=1:2:nargin-2 • pname = varargin{j}; • pval = varargin{j+1}; • k = strmatch(lower(pname), okargs);%#ok • if isempty(k) • error('Bioinfo:getpdb:UnknownParameterName',... • 'Unknown parameter name: %s.',pname); • elseif length(k)>1 • error('Bioinfo:getpdb:AmbiguousParameterName',... • 'Ambiguous parameter name: %s.',pname); • else • switch(k) • case 1 % tofile • if ischar(pval) • tofile = true; • filename = pval; • end • case 2 % mirrorsite • if ischar(pval) • mirrorsite = pval; • if isempty(strfind(mirrorsite,'/pdb')) • error('Bioinfo:getpdb:BadMirrorSite',... • 'MIRROR string does not appear to be a PDB mirror site.'); • end • end • case 3 % sequenceonly • seqonly = opttf(pval); • if isempty(seqonly) • error('Bioinfo:getpdb:InputOptionNotLogical','%s must be a logical value, true or false.',... • upper(char(okargs(k)))); • end • end • end • end • end
Detail Investigation of the function codes: (getpdb.m) • % error if ID isn't a string • if ~ischar(pdbID) • error('Bioinfo:getpdb:NotString','Access Number is not a string.') • end • % get sequence from pdb.fasta if SEQUENCEONLY is true, otherwise full pdb • if seqonly == true • searchurl = [mirrorsite '/cgi/getSequence.cgi/' pdbID '.fasta?chId=' pdbID '&format=fasta']; • [header, pdb] = fastaread(searchurl);%#ok • else • searchurl = [mirrorsite '/cgi/explore.cgi?job=download&pdbId=' pdbID '&opt=show&format=PDB&pre=1']; • % get the html file that is returned as a string • s=urlread(searchurl); • % replace the html version of & • s=strrep(s,'&','&'); • % Find first line of the actual data • start = regexp(s,'\nHEADER'); fastaread: reads FASTA format file urlread: returns the contents of a URL as a string strrep: replaces string with another regexp: matches regular expression
Detail Investigation of the function codes: (e.g., getpdb.m) • if isempty(start) • % search for text indicating that there weren't any files found • notfound=regexp(s,'Your query found .*NO.* structures'); • % string was found, meaning no results were found • if ~isempty(notfound), • error('Bioinfo:getpdb:PDBIDNotFound','The ID you were searching for, %s, was not found in the PDB database.',pdbID) ; • end • error('Bioinfo:getpdb:PDBIDAccessProblem','Unknown problem accessing entry %s in the PDB database.',pdbID); • end • [dummy, endOfFile] = regexp(s,'\nEND.*?\n');%#ok • % shorten string, to search for uid info • s=s(start+1:endOfFile); • %make each line a separate row in string array • pdbdata = char(strread(s,'%s','delimiter','\n','whitespace','')); • %pass to PDBREAD to create structure • pdb=pdbread(pdbdata); • end char: creates character array (string) strread: reads formatted data from string pdbread: reads a Protein Data Bank file into a structure
Detail Investigation of the function codes: (e.g., getpdb.m) • if nargout • pdbstruct = pdb; • if ~seqonly • % add URL • pdbstruct.SearchURL = searchurl; • end • else • if seqonly || ~usejava('desktop') • disp(pdb); • else • disp(pdb); • disp([char(9) 'SearchURL: <a href="' searchurl '"> ' pdbID ' </a>']); • end • end • % write out file • if tofile == true • writefile = 'Yes'; • % check to see if file already exists • if exist(filename,'file') • % use dialog box to display options • writefile=questdlg(sprintf('The file %s already exists. Do you want to overwrite it?',filename), ... • '', ... • 'Yes','No','Yes'); • end • switch writefile, • case 'Yes', • if exist(filename,'file') • disp(['File ' filename ' overwritten.']); • end • savedata(filename,pdbdata); • case 'No', • disp(['File ' filename ' not written.']); • end • end questdlg (Question): creates a modal dialog box that automatically wraps the cell array or string (vector or matrix) Question to fit an appropriately sized window
Detail Investigation of the function codes: (e.g., getpdb.m) • %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% • function savedata(filename,pdbtext) • fid=fopen(filename,'wb'); • rows = size(pdbtext,1); • for rcount=1:rows-1, • fprintf(fid,'%s\n',pdbtext(rcount,:)); • end • fprintf(fid,'%s',pdbtext(rows,:)); • fclose(fid); fopen: open a file for read access fprintf: writes formatted data to file fclose: closes a file opened with fopen
Detail Investigation of the functions related to getpdb.m code: • pdbread reads data from a PDB formatted file into MATLAB • Syntax PDBData = pdbread('File') • Arguments File: Protein Data Bank (PDB) formatted file (ASCII text file). Enter a filename, a path and filename, or a URL pointing to a file. File can also be a MATLAB character array that contains the text for a PDB file. • The data stored in each record of the PDB file is converted, where appropriate, to a MATLAB structure. For example, the ATOM records in a PDB file are converted to an array of structures with the following fields: AtomSerNo, AtomName, altLoc, resName, chainID, resSeq, iCode, X, Y, Z, occupancy, tempFactor, segID, element, and charge. • The sequence information from the PDB file is stored in the Sequence field of PDBData. The sequence information is itself a structure with the fields NumOfResidues, ChainID, ResidueNames, and Sequence. The field ResidueNames contains the three-letter codes for the sequence residues. The field Sequence contains the single-letter codes for the sequence. If the sequence has modified residues, then the ResidueNames might not correspond to the standard three-letter amino acid codes, in which case the field Sequence will contain a ? in the position corresponding to the modified residue.
Detail Investigation of the function codes: (pdbread) Examples: • Get information for Nitrate/Nitrite Response Regulator Protein Narl with PDB ID 1A04 from the Protein Data Bank, store information in the file 1a04.txt • getpdb( '1A04','ToFile', '1a04.txt') • See the content of the file 1a04.txt (5310-line text) • Now read the file back into MATLAB • pdbdata = pdbread('1a04.txt') • Let’s try this with PDB ID 1a14 • Now, we will see a 5680-line text file
Detail Investigation of the functions related to getpdb.m code: • fastaread function reads data from a FASTA formatted file into a MATLAB structure with Header and Sequence fields • Syntax FASTAData = fastaread('File')[Header, Sequence] = fastaread('File')multialignread(..., 'PropertyName', PropertyValue,...)multialignread(..., 'IgnoreGaps', IgnoreGapsValue) • Arguments FileFASTA: formatted file (ASCII text file). Enter a filename, a path and filename, or a URL pointing to a file. File can also be a MATLAB character array that contains the text for a filename. IgnoreGapsValue: Property to control removing gap symbols. FASTAData: MATLAB structure with the fields Header and Sequence • Example • Reading the human mitochondrion genome in FASTA format entrezSite='http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?'textOptions = '&txt=on &view=fasta’genbankID = '&list_uids=NC_001807' mitochondrion = fastaread([entrezSite textOptions genbankID])