310 likes | 631 Views
Python. What is Biopython ?. Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. can parse bioinformatics files into local data structures Fasta , GenBank , Blast output Clustalw etc.
E N D
What is Biopython? • Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. • can parse bioinformatics files into local data structures • Fasta, GenBank, Blast output Clustalw etc. • Can access many files directly ( web database, NCBI) from within the script. • Works with sequences and records • Many search algorithms, comparative algorithms and format options.
Installing BioPython • Comes with Anaconda. You don’t even have to type in the import commands! • If you use the standard IDLE environment you will need to download BioPython and place it in the proper directory. • Bioinformatics has become so important in recent years that almost every programming environment, C++, Perl, etc has its own Bioinfo libraries.
Sequence objects • Biological sequences represent the main point of interest in Bioinformatics processing. Python includes a special datatype called a Sequence. • Sequence objects are not the same as Python strings. They are really strings together with additional information, such as an alphabet, and a variety of methods such as translate(), reverse_complement() and so on. • dna = ‘AGTACACTGGT ‘ this is a pure string • // Here is how you create a sequence object. • seqdna = Seq(‘AGTACACTGGT ‘, Alphabet()) sequence obj • Note that seqdna is a sequence object not just a string.
Alphabets - See IUPAC (international union of pure and applied chemistry) • Alphabets are just the set of allowable characters that are used in the string. • IUPAC.unambiguous_dna is really just the set {A,C,G, T} of nucleotides. • IUPAC.unambiguous_rna is {A,C,G,U} • IUPAC.protein is just the 20 standard amino acids {A,R,N,D,C,Q,E,H,I,L,K,M,F,P,S,T,W,Y,V} • and others • We will use mainly the {A,C,G,T} DNA set. • Nice for type checking our sequences.
Dumping Alphabets • from Bio.Alphabet import IUPAC • print IUPAC.unambiguous_dna.letters • print IUPAC.ambiguous_dna.letters • print IUPAC.unambiguous_rna.letters • print IUPAC.protein.letters • OUTPUT • GATC • GATCRYWSMKHBVDN • GAUC • ACDEFGHIKLMNPQRSTVWY
Can work with Sequence objects like strings • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • my_seq= Seq("GATCG", IUPAC.unambiguous_dna) • print my_seq[0] prints first letter • print len(my_seq) print length of string in sequence • print Seq(“AAAA”).count(“AA”) non overlapping count ie 2 • print GC(my_seq) Gives the GC % of the sequence. • print my_seq[2:5] We can even slice them. Returns a Seq. • #convert seqobj to a pure string obj • dna_string = str(my_seq)
MutaableSeq • >>>from Bio.Seq import Seq • >>>from Bio.Alphabet import IUPAC • my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) • Observe what happens if you try to edit the sequence: • >>> my_seq[5] = "G" Traceback (most recent call last): ... TypeError: ’Seq’ object does not support item assignment However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it: • >>> mutable_seq= my_seq.tomutable() • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA’, IUPACUnambiguousDNA())
We can modify these • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq[5] = "C" >>> mutable_seqMutableSeq(’GCCATCGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seqMutableSeq(’GCCACGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq.reverse() • >>> mutable_seqMutableSeq(’CCCGTGGGAAAGTCGCCGGGTAATGCACCG’, IUPACUnambiguousDNA())
Nucleotide sequences and (reverse) complements • >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC • >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) • >>> my_seq • Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) • >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())
Reversing a Sequence • an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step • # FORWARD • >>> my_seqSeq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • #BACKWARD ( Using a -1 step slice ) • >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())
Double Stranded DNA DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ DNA template strand (aka Watson strand, strand −1)
Transcription • 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ • ||||||||||||||||||||||||||||||||||||||| • 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ • Transcription • 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA
Lets do some Reverse Comp • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • template_dna= coding_dna.reverse_complement() • print template_dna • CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT
Transcribe ( T->U ) • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • messenger_rna = coding_dna.transcribe() • print messenger_rna • AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG • //or you can do both • messenger_rna = coding_dna.reverse_complement().transcribe()
Translate into protein • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) • print messenger_rna • print messenger_rna.translate() # I added the spaces • AUG GCC AUU GUA AUG GGC CGC UGA AAG GGU GCC CGA UAG • MAIVMGR*KGAR* • # the * represents stop codons.
Printing Tables • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • from Bio.Data import CodonTable • stdTable = CodonTable.unambiguous_dna_by_id[1] • print stdTable • mitoTable= CodonTable.unambiguous_dna_by_id[2] • print mitoTable
Table 1 Standard, SGC0 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA Stop| A T | TTG L(s)| TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L(s)| CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I | ACT T | AAT N | AGT S | T A | ATC I | ACC T | AAC N | AGC S | C A | ATA I | ACA T | AAA K | AGA R | A A | ATG M(s)| ACG T | AAG K | AGG R | G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V | GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+-- Table 2 Vertebrate Mitochondrial, SGC1 | T | C | A | G | --+---------+---------+---------+---------+-- T | TTT F | TCT S | TAT Y | TGT C | T T | TTC F | TCC S | TAC Y | TGC C | C T | TTA L | TCA S | TAA Stop| TGA W | A T | TTG L | TCG S | TAG Stop| TGG W | G --+---------+---------+---------+---------+-- C | CTT L | CCT P | CAT H | CGT R | T C | CTC L | CCC P | CAC H | CGC R | C C | CTA L | CCA P | CAA Q | CGA R | A C | CTG L | CCG P | CAG Q | CGG R | G --+---------+---------+---------+---------+-- A | ATT I(s)| ACT T | AAT N | AGT S | T A | ATC I(s)| ACC T | AAC N | AGC S | C A | ATA M(s)| ACA T | AAA K | AGA Stop| A A | ATG M(s)| ACG T | AAG K | AGG Stop| G --+---------+---------+---------+---------+-- G | GTT V | GCT A | GAT D | GGT G | T G | GTC V | GCC A | GAC D | GGC G | C G | GTA V | GCA A | GAA E | GGA G | A G | GTG V(s)| GCG A | GAG E | GGG G | G --+---------+---------+---------+---------+--
The SeqRecord Object • A SeqRecord is a structure that allows the storage of additional information with a sequence. This includes the usual information found in standard genbank files. The following is a sampleof the fields in SeqRecord. • .seq - The sequence • .id - The primary ID used to identify the sequence (String) • .name – The common name of the sequence • .annotations – A dictionary of additional information about the sequence • .features –A list of SeqFeatureobjects • and others.
Build From Scratch • >>> from Bio.Seq import Seq • >>> simple_seq = Seq("GATC") • >>> from Bio.SeqRecord import SeqRecord • >>> simple_seq_r= SeqRecord(simple_seq) • or pass in the id, description etc. • >>> simple_seq_r.id = "AC12345" • >>> simple_seq_r.description = "Made up sequence I wish I could write a paper about" • >>> print(simple_seq_r.description) • Made up sequence I wish I could write a paper about • >>> simple_seq_r.seq • Seq(’GATC’, Alphabet())
Fill SeqRecord from Fasta file • >gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC • ------------------------------------------------------------------------------------------ • >>> from Bio import SeqIO • # Note that SeqIO.read will only read one record. • >>> record = SeqIO.read("NC_005816.fna", "fasta") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()), id=’gi|45478711|ref|NC_005816.1|’, name=’gi|45478711|ref|NC_005816.1|’, description=’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... sequence’, dbxrefs=[])
ACCEss the fields Individually • >>> record.seqSeq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()) • >>> record.id • ’gi|45478711|ref|NC_005816.1|’ • >>> record.description • ’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence’
These are missing • >>> record.dbxrefs [] • >>> record.annotations {} • >>> record.letter_annotations {} • >>> record.features [] • Note which is a dict and which is a list!!
Reading Genbank Files • LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008 • DEFINITION Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence. • ACCESSION NC_005816 • VERSION NC_005816.1 GI:45478711 • PROJECT GenomeProject:10638
Read the File • >>> from Bio import SeqIO • >>> record = SeqIO.read("NC_005816.gb", "genbank") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, IUPACAmbiguousDNA()), id=’NC_005816.1’, name=’NC_005816’, description=’Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence.’, dbxrefs=[’Project:10638’])
Read a record • from Bio import SeqIO • record = SeqIO.read("micoplasmaGen.gb","genbank") • print record.description • ct=0 • for f in record.features: • if f.type=='gene': • ct+=1 • print ct