Python

Python

What is Biopython? • Biopython is a python library of resources for developers of Python-base software for bioinformatics and research. • can parse bioinformatics files into local data structures • Fasta, GenBank, Blast output Clustalw etc. • Can access many files directly ( web database, NCBI) from within the script. • Works with sequences and records • Many search algorithms, comparative algorithms and format options.

Installing BioPython • Comes with Anaconda. You don’t even have to type in the import commands! • If you use the standard IDLE environment you will need to download BioPython and place it in the proper directory. • Bioinformatics has become so important in recent years that almost every programming environment, C++, Perl, etc has its own Bioinfo libraries.

Sequence objects • Biological sequences represent the main point of interest in Bioinformatics processing. Python includes a special datatype called a Sequence. • Sequence objects are not the same as Python strings. They are really strings together with additional information, such as an alphabet, and a variety of methods such as translate(), reverse_complement() and so on. • dna = ‘AGTACACTGGT ‘  this is a pure string • // Here is how you create a sequence object. • seqdna = Seq(‘AGTACACTGGT ‘, Alphabet())  sequence obj • Note that seqdna is a sequence object not just a string.

Alphabets - See IUPAC (international union of pure and applied chemistry) • Alphabets are just the set of allowable characters that are used in the string. • IUPAC.unambiguous_dna is really just the set {A,C,G, T} of nucleotides. • IUPAC.unambiguous_rna is {A,C,G,U} • IUPAC.protein is just the 20 standard amino acids {A,R,N,D,C,Q,E,H,I,L,K,M,F,P,S,T,W,Y,V} • and others • We will use mainly the {A,C,G,T} DNA set. • Nice for type checking our sequences.

Dumping Alphabets • from Bio.Alphabet import IUPAC • print IUPAC.unambiguous_dna.letters • print IUPAC.ambiguous_dna.letters • print IUPAC.unambiguous_rna.letters • print IUPAC.protein.letters • OUTPUT • GATC • GATCRYWSMKHBVDN • GAUC • ACDEFGHIKLMNPQRSTVWY

Can work with Sequence objects like strings • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • my_seq= Seq("GATCG", IUPAC.unambiguous_dna) • print my_seq[0]  prints first letter • print len(my_seq)  print length of string in sequence • print Seq(“AAAA”).count(“AA”)  non overlapping count ie 2 • print GC(my_seq)  Gives the GC % of the sequence. • print my_seq[2:5]  We can even slice them. Returns a Seq. • #convert seqobj to a pure string obj • dna_string = str(my_seq)

MutaableSeq • >>>from Bio.Seq import Seq • >>>from Bio.Alphabet import IUPAC • my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA", IUPAC.unambiguous_dna) • Observe what happens if you try to edit the sequence: • >>> my_seq[5] = "G" Traceback (most recent call last): ... TypeError: ’Seq’ object does not support item assignment However, you can convert it into a mutable sequence (a MutableSeq object) and do pretty much anything you want with it: • >>> mutable_seq= my_seq.tomutable() • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA’, IUPACUnambiguousDNA())

We can modify these • >>> mutable_seqMutableSeq(’GCCATTGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq[5] = "C" >>> mutable_seqMutableSeq(’GCCATCGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) >>> mutable_seq.remove("T") >>> mutable_seqMutableSeq(’GCCACGTAATGGGCCGCTGAAAGGGTGCCC’, IUPACUnambiguousDNA()) • >>> mutable_seq.reverse() • >>> mutable_seqMutableSeq(’CCCGTGGGAAAGTCGCCGGGTAATGCACCG’, IUPACUnambiguousDNA())

Nucleotide sequences and (reverse) complements • >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC • >>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC", IUPAC.unambiguous_dna) • >>> my_seq • Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • >>> my_seq.complement() Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG', IUPACUnambiguousDNA()) • >>> my_seq.reverse_complement() Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC', IUPACUnambiguousDNA())

Reversing a Sequence • an easy way to just reverse a Seq object (or a Python string) is slice it with -1 step • # FORWARD • >>> my_seqSeq('GATCGATGGGCCTATATAGGATCGAAAATCGC', IUPACUnambiguousDNA()) • #BACKWARD ( Using a -1 step slice ) • >>> my_seq[::-1] Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG', IUPACUnambiguousDNA())

Double Stranded DNA DNA coding strand (aka Crick strand, strand +1) 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ ||||||||||||||||||||||||||||||||||||||| 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ DNA template strand (aka Watson strand, strand −1)

Transcription • 5’ ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ • ||||||||||||||||||||||||||||||||||||||| • 3’ TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ • Transcription • 5’ AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ Single stranded messenger RNA

Lets do some Reverse Comp • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • template_dna= coding_dna.reverse_complement() • print template_dna • CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT

Transcribe ( T->U ) • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • coding_dna = Seq(“ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", IUPAC.unambiguous_dna) • messenger_rna = coding_dna.transcribe() • print messenger_rna • AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG • //or you can do both • messenger_rna = coding_dna.reverse_complement().transcribe()

Translate into protein • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • messenger_rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna) • print messenger_rna • print messenger_rna.translate() # I added the spaces • AUG GCC AUU GUA AUG GGC CGC UGA AAG GGU GCC CGA UAG • MAIVMGR*KGAR* • # the * represents stop codons.

StandardTranslation Table

Printing Tables • from Bio.Seq import Seq • from Bio.Alphabet import IUPAC • from Bio.Data import CodonTable • stdTable = CodonTable.unambiguous_dna_by_id[1] • print stdTable • mitoTable= CodonTable.unambiguous_dna_by_id[2] • print mitoTable

Codon - Amino Acids

The SeqRecord Object • A SeqRecord is a structure that allows the storage of additional information with a sequence. This includes the usual information found in standard genbank files. The following is a sampleof the fields in SeqRecord. • .seq - The sequence • .id - The primary ID used to identify the sequence (String) • .name – The common name of the sequence • .annotations – A dictionary of additional information about the sequence • .features –A list of SeqFeatureobjects • and others.

Build From Scratch • >>> from Bio.Seq import Seq • >>> simple_seq = Seq("GATC") • >>> from Bio.SeqRecord import SeqRecord • >>> simple_seq_r= SeqRecord(simple_seq) • or pass in the id, description etc. • >>> simple_seq_r.id = "AC12345" • >>> simple_seq_r.description = "Made up sequence I wish I could write a paper about" • >>> print(simple_seq_r.description) • Made up sequence I wish I could write a paper about • >>> simple_seq_r.seq • Seq(’GATC’, Alphabet())

Fill SeqRecord from Fasta file • >gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC • ------------------------------------------------------------------------------------------ • >>> from Bio import SeqIO • # Note that SeqIO.read will only read one record. • >>> record = SeqIO.read("NC_005816.fna", "fasta") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()), id=’gi|45478711|ref|NC_005816.1|’, name=’gi|45478711|ref|NC_005816.1|’, description=’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... sequence’, dbxrefs=[])

ACCEss the fields Individually • >>> record.seqSeq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, SingleLetterAlphabet()) • >>> record.id • ’gi|45478711|ref|NC_005816.1|’ • >>> record.description • ’gi|45478711|ref|NC_005816.1| Yersinia pestisbiovar Microtus ... pPCP1, complete sequence’

These are missing • >>> record.dbxrefs [] • >>> record.annotations {} • >>> record.letter_annotations {} • >>> record.features [] • Note which is a dict and which is a list!!

Reading Genbank Files • LOCUS NC_005816 9609 bp DNA circular BCT 21-JUL-2008 • DEFINITION Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence. • ACCESSION NC_005816 • VERSION NC_005816.1 GI:45478711 • PROJECT GenomeProject:10638

Read the File • >>> from Bio import SeqIO • >>> record = SeqIO.read("NC_005816.gb", "genbank") • >>> record SeqRecord(seq=Seq(’TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG’, IUPACAmbiguousDNA()), id=’NC_005816.1’, name=’NC_005816’, description=’Yersinia pestisbiovar Microtus str. 91001 plasmid pPCP1, complete sequence.’, dbxrefs=[’Project:10638’])

Read a record • from Bio import SeqIO • record = SeqIO.read("micoplasmaGen.gb","genbank") • print record.description • ct=0 • for f in record.features: • if f.type=='gene': • ct+=1 • print ct

Python

Python

Presentation Transcript

Python

Python

Python

Python

Python

Python

Python

Python

Python

Python

Python

Python

Python

Python Course | Python Programming | Python Tutorial | Python Training | Edureka

Python Tutorial | What is Python? | Python Programming