290 likes | 593 Views
Introduction to Python. BCHB524 2008 Lecture 1 . Outline. Why Python? Installation Basic Data Types Variables Functions Control Flow Useful References Reverse Complement. Why Python?. Free Portable Object-oriented Clean syntax Dynamic Scientific, Commercial Support libraries
E N D
Introduction to Python BCHB5242008Lecture 1 BCHB524 - 2008 - Edwards
Outline • Why Python? • Installation • Basic Data Types • Variables • Functions • Control Flow • Useful References • Reverse Complement BCHB524 - 2008 - Edwards
Why Python? • Free • Portable • Object-oriented • Clean syntax • Dynamic • Scientific, Commercial • Support libraries • Extensible • Interactive • Modern BCHB524 - 2008 - Edwards http://xkcd.com/353/
Why Python for Bioinformatics? • Good with • Strings • Files and Formats • Web and Databases • Objects and Concepts • BioPython • www.biopython.org BCHB524 - 2008 - Edwards
Installation • Python Homepage • www.python.org • >> Download >> Select Operating System • We’ll install version 2.5.x on Windows • OS X & Linux versions also readily available. • Integrated development environment – IDLE BCHB524 - 2008 - Edwards
Basic Data Types • String • Integer • Floats • Boolean • None • Tuples BCHB524 - 2008 - Edwards
Basic Data Types: Integers >>> 3 3 >>> 3*4 12 >>> 3/4 0 >>> abs(-10) 10 >>> 3%4 3 >>> 2**32 4294967296L >>> 2**64 18446744073709551616L >>> 2**128 340282366920938463463374607431768211456L >>> print 2**128 340282366920938463463374607431768211456 BCHB524 - 2008 - Edwards
Basic Data Types: Floats >>> 3.0 3.0 >>> 3.0*4.0 12.0 >>> 3.0/4.0 0.75 >>> abs(-10.0) 10.0 >>> 2.0**32 4294967296.0 >>> 2.0**64 1.8446744073709552e+019 >>> 2.0**128 3.4028236692093846e+038 >>> print 2.0**128 3.40282366921e+038 BCHB524 - 2008 - Edwards
Basic Data Types: Strings >>> 'gcatgacgttattacgactctgtgtggcgtctgctggg' 'gcatgacgttattacgactctgtcacgccgcggtgcgactgaggcgtggcgtctgctggg' >>> 'gcatgacgttattacgactctgtgtggcgtctgctggg'[0] 'g' >>> 'gcatgacgttattacgactctgtgtggcgtctgctggg'[-1] 'g' >>> 'gcatgacgttattacgactctgtgtggcgtctgctggg'[0:4] 'gcat' >>> 'ATTCG'+'ATTCG' 'ATTCGATTCG' >>> 'ATTCG'*6 'ATTCGATTCGATTCGATTCGATTCGATTCG' >>> len('gcatgacgttattacgactctgtgtggcgtctgctggg') 38 >>>'gcatgacgttattacgactctgtgtggcgtctgctggg'.upper() 'GCATGACGTTATTACGACTCTGTGTGGCGTCTGCTGGG' >>>'gcatgacgttattacgactctgtgtggcgtctgctggg'.count('a') 5 BCHB524 - 2008 - Edwards
Basic Data Types: The Rest • Special literal values: • True, False, and None >>> printTrue, False, None True False None • Tuples – pairs, triples, etc. >>> print ('A','T','G') ('A', 'T', 'G') >>> print(2.25,4.125,'a') (2.25, 4.125, 'a') BCHB524 - 2008 - Edwards
Variables • Variables store values for later use>>> seq = 'gcatgacgttattacgactctgtgtggcgtctgctggg‘>>> len(seq) 38 >>> seq = seq * 3 >>> len(seq) 114 >>> met = ('A','T','G') >>> print met ('A', 'T', 'G') BCHB524 - 2008 - Edwards
Using Functions • Execute a small (predefined) task >>> abs(-10) 10 >>> min(1,2,3,4,5,6) 1 >>> max(1,2,3,4,5,6) 6 >>> int(2.6) 2 >>> float(‘2.5’) 2.5 >>> int(float(‘2.5’)) 2 BCHB524 - 2008 - Edwards
Using Methods • Execute a small task with a specific object >>> seq = 'gcatgacgttattacgactctgtgtggcgtctgctggg‘>>> seq.count(‘a’) 5 >>> seq.upper() 'GCATGACGTTATTACGACTCTGTGTGGCGTCTGCTGGG‘ >>> seq.endswith(‘tggg’) True >>> seq.find(‘tggg’) 34 >>> seq.upper().find(‘TGGG’) 34 BCHB524 - 2008 - Edwards
Defining New Functions • Describe how to execute a small task >>> defbytwo(x): return x*2 >>> bytwo(2) 4 >>> bytwo(2.5) 5 >>> bytwo(2.75) 5.5 BCHB524 - 2008 - Edwards
If Statements • Conditional execution if seq.startswith('atg'): initMet = True seq = seq[3:] else: initMet = False • Note use of indentation to define a block! BCHB524 - 2008 - Edwards
For Statements • Sequential execution count = 0 for nuc in seq: if nuc == 'a': count = count + 1 printcount • Note use of indentation to define a block! BCHB524 - 2008 - Edwards
References • Websites • www.python.org • >> Documentation >> Library Reference • “Module Docs” in Windows • >> Start Menu >> Python >> Module Docs • www.biopython.org • >> Documentation • Books • Lutz and Archer, “Learning Python” • Kinser, “Python for Bioinformatics” BCHB524 - 2008 - Edwards
DNA as a string seq = ‘gcatgacgttattacgactctgtgtggcgtctgctgggg’ seqlen = len(seq) # set i to 0, 3, 6, 9, ..., 36 for i in range(0,seqlen,3): # As a tuple codon = (seq[i],seq[i+1],seq[i+2]) # As a string codon = seq[i:i+3] print codon print “Number of Met. amino-acids”, seq.count(‘atg’) BCHB524 - 2008 - Edwards
DNA as a string • What about upper and lower case? • ATG vs atg? • Differences between DNA and RNA sequence? • Substitute U for each T? • How about ambiguous nucleotide symbols? • What should we do with ‘N’ and other ambiguity codes (R, Y, W, S, M, K, H, B, V, D)? • Strings don’t know any biology! BCHB524 - 2008 - Edwards
DNA as a string seq = ‘gcatgacgttattacgactctgtgtggcgtctgctgggg’ def inFrameMet(seq): seqlen = len(seq) count = 0 for i in range(0,seqlen,3): codon = seq[i:i+3] if codon.upper() == ‘ATG’: count = count + 1 return count print “Number of Met. amino-acids”, inFrameMet(seq) BCHB524 - 2008 - Edwards
DNA as a string seq = ‘gcatgacgttattacgactctgtgtggcgtctgctgggg’ def reverseComplement(seq): newseq = ‘’ for nuc in seq: if nuc == ‘A’: newseq = ‘T’+newseq elif nuc == ‘C’: newseq = ‘G’+newseq elif nuc == ‘G’: newseq = ‘C’+newseq elif nuc == ‘T’: newseq = ‘A’+newseq return newseq print “Reverse complement:”, reverseComplement(seq) BCHB524 - 2008 - Edwards
DNA as a string seq = ‘gcatgacgttattacgactctgtgtggcgtctgctgggg’ def reverseComplement(seq): seq = seq.upper() newseq = ‘’ for nuc in seq: if nuc == ‘A’: newseq = ‘T’+newseq elif nuc == ‘C’: newseq = ‘G’+newseq elif nuc == ‘G’: newseq = ‘C’+newseq elif nuc == ‘T’: newseq = ‘A’+newseq else: newseq = nuc+newseq return newseq print “Reverse complement:”, reverseComplement(seq) BCHB524 - 2008 - Edwards
Creating and Running Python Scripts • Creating new scripts: • File >> New Window • Write script as desired • Save in My Documents >> BCHB524 • In IDLE: • File >> Open (browse to script.py) • Run >> Run Module (or just hit F5) • Results are in command window • From Windows Command-Line: • Start >> Run (type “cmd”) • cd “My Documents\BCHB524” • script.py • Double-click on script.py BCHB524 - 2008 - Edwards
Getting user input • Most programs operate on user input supplied at run-time. • raw_input function >>> seq = raw_input(‘Enter the DNA sequence: ‘) Enter the DNA sequence: ACTGACTGACTG >>> print seq ACTGACTGACTG • Command-line arguments import sys seq = sys.argv[1] print seq C:\BCHM524>script.py ACTGACTGACTG BCHB524 - 2008 - Edwards
Lab Exercises • Install Python from www.python.org in “My Documents\Python25” • Run IDLE, check out installed and on-line help. • Try each of the examples shown in these slides BCHB524 - 2008 - Edwards
Lab Exercises • Download or copy-and-paste the anthrax_sasp.nuc file from the course web-site. Write Python scripts to answer: • Does the sequence start with Met? • How many nucleotides in the SASP gene? • How many amino-acids in the SASP protein? • Use UniSTS (“google UniSTS”) to look up PCR markers for your favorite gene • For each forward and reverse primer, compute the reverse complement sequence BCHB524 - 2008 - Edwards
Lab Exercises • Write a program to determine whether or not a given DNA sequence consists of a number of (perfect) tandem repeats. • Test it on sequences: • AAAAAAAAAAAAAAAA • CACACACACACACAC • ATTCGATTCGATTCG • GTAGTAGTAGTAGTA • TCAGTCACTCACTCAG BCHB524 - 2008 - Edwards
Lab Exercises • Write a program to test whether a PCR primer is a reverse complement palindrome. • Such a primer might fold and self-hybridize! • Test your program on the following primers: • TTGAGTAGACGCGTCTACTCAA • TTGAGTAGACGTCGTCTACTCAA • ATATATATATATATAT • ATCTATATATATGTAT BCHB524 - 2008 - Edwards
Lab Exercises • Using just the concepts introduced in Lecture1, find as many ways as possible to code DNA reverse complement. • You may use any built-in function or string method. • You may use only basic data-types. • Compare and critique each technique for robustness, speed, and correctness. BCHB524 - 2008 - Edwards