1 / 17

Sequence File Parsing using Biopython

Sequence File Parsing using Biopython. BCHB524 2012 Lecture 10. Review. Modules in the standard-python library: sys, os, os.path – access files, program environment zipfile, gzip – access compressed files directly urllib – access web-resources (URLs) as files

conway
Download Presentation

Sequence File Parsing using Biopython

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence File Parsing using Biopython BCHB5242012Lecture 10 BCHB524 - 2012 - Edwards

  2. Review • Modules in the standard-python library: • sys, os, os.path – access files, program environment • zipfile, gzip – access compressed files directly • urllib – access web-resources (URLs) as files • csv – read delimited line based records from files • Plus lots, lots more. BCHB524 - 2012 - Edwards

  3. BioPython • Additional modules that make many common bioinformatics tasks easier • File parsing (many formats) & web-retrieval • Formal biological alphabets, codon tables, etc • Lots of other stuff… • Have to install separately • Not part of standard python, or Enthought • biopython.org BCHB524 - 2012 - Edwards

  4. Biopython: Fasta format • Most common biological sequence data format • Header/Description line • >accession description • Multi-accession sometimes represented • accession1|accession2|accession3 • lots of variations, no standardization • No prescribed format for the description • Other lines • sequence, one chunk per line. • Usually all lines, except the last, are the same length. BCHB524 - 2012 - Edwards

  5. BioPython: Bio.SeqIO import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.id:\n\t", seq_record.idprint"seq_record.description:\n\t",seq_record.descriptionprint"seq_record.seq:\n\t",seq_record.seqseqfile.close() BCHB524 - 2012 - Edwards

  6. Biopython: Other formats • Genbank format • From NCBI, also format for RefSeq sequence • UniProt/SwissProtflat-file format • From UniProt for SwissProt and TrEMBL • UniProt-XML format: • From UniProt for SwissProtand TrEMBL • Use the gzip module to handle compressed sequence databases BCHB524 - 2012 - Edwards

  7. BioPython: Bio.SeqIO import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "genbank"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.id:\n\t", seq_record.idprint"seq_record.description:\n\t",seq_record.descriptionprint"seq_record.seq:\n\t",seq_record.seqseqfile.close() BCHB524 - 2012 - Edwards

  8. BioPython: Bio.SeqIO import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "swiss"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.id:\n\t", seq_record.idprint"seq_record.description:\n\t",seq_record.descriptionprint"seq_record.seq:\n\t",seq_record.seqseqfile.close() BCHB524 - 2012 - Edwards

  9. BioPython: Bio.SeqIO import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.id:\n\t", seq_record.idprint"seq_record.description:\n\t",seq_record.descriptionprint"seq_record.seq:\n\t",seq_record.seqseqfile.close() BCHB524 - 2012 - Edwards

  10. BioPython: Bio.SeqIO and gzip import Bio.SeqIOimport sysimport gzip# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "fasta"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.id:\n\t", seq_record.idprint"seq_record.description:\n\t",seq_record.descriptionprint"seq_record.seq:\n\t",seq_record.seqseqfile.close() BCHB524 - 2012 - Edwards

  11. What about the other "stuff" • BioPython makes it easy to get access to non-sequence information stored in "rich" sequence databases • Annotations • Cross-References • Sequence Features • Literature BCHB524 - 2012 - Edwards

  12. BioPython: Bio.SeqIO import Bio.SeqIOimport sysimport gzip# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):# What else is available in the SeqRecord?print"\n------NEW SEQRECORD------\n"print"repr(seq_record)\n\t",repr(seq_record)print"dir(seq_record)\n\t",dir(seq_record)breakseqfile.close() BCHB524 - 2012 - Edwards

  13. BioPython: Bio.SeqRecord import Bio.SeqIOimport sysimport gzip# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the FASTA file and iterate through its sequencesseqfile = gzip.open(seqfilename)for seq_record in Bio.SeqIO.parse(seqfile, "uniprot-xml"):# Print out the various elements of the SeqRecordprint"\n------NEW SEQRECORD------\n"print"seq_record.annotations\n\t",seq_record.annotationsprint"seq_record.features\n\t",seq_record.featuresprint"seq_record.dbxrefs\n\t",seq_record.dbxrefsprint"seq_record.format('fasta')\n",seq_record.format('fasta')breakseqfile.close() BCHB524 - 2012 - Edwards

  14. BioPython: Random access • Sometimes you want to access the sequence records "randomly"… • …to pick out the ones you want (by accession) • Why not make a dictionary, with accessions as keys, and SeqRecord values? • Use SeqIO.to_dict(…) • What if you don't want to hold it all in memory • Use SeqIO.index(…) BCHB524 - 2012 - Edwards

  15. BioPython: Bio.SeqIO.to_dict(…) import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Open the sequence databaseseqfile = open(seqfilename)# Use to_dict to make a dictionary of sequence recordssprot_dict = Bio.SeqIO.to_dict(Bio.SeqIO.parse(seqfile, "uniprot-xml"))# Close the fileseqfile.close()# Access and print a sequence recordprint sprot_dict['Q6GZV8'] BCHB524 - 2012 - Edwards

  16. BioPython: Bio.SeqIO.index(…) import Bio.SeqIOimport sys# Check the inputiflen(sys.argv) < 2:print >>sys.stderr, "Please provide a sequence file"    sys.exit(1)# Get the sequence filenameseqfilename = sys.argv[1]# Use to_dict to make a dictionary of sequence recordssprot_dict = Bio.SeqIO.index(seqfilename, "uniprot-xml")# Access and print a sequence recordprint sprot_dict['Q6GZV8'] BCHB524 - 2012 - Edwards

  17. Exercises • Read through and try the examples from Chapters 2-5 of BioPython's Tutorial. • Download human proteins from RefSeq and compute amino-acid frequencies for the (RefSeq) human proteome. • Which amino-acid occurs the most? The least? • Hint: access RefSeq human proteins fromftp://ftp.ncbi.nih.gov/refseq • Download human proteins from SwissProt and compute amino-acid frequencies for the SwissProt human proteome. • Which amino-acid occurs the most? The least? • Hint: access SwissProt human proteins from http://www.uniprot.org/downloads -> “Taxonomic divisions” • How similar are the human amino-acid frequencies of in RefSeq and SwissProt? BCHB524 - 2012 - Edwards

More Related