1 / 102

Programming in Python

Programming in Python. Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu-dresden.de. Motivation. All these sequences are winged helix DNA binding domains. How can we group them into families?. mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep

thalia
Download Presentation

Programming in Python

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming in Python Michael Schroeder Andreas Henschel {ms, ah}@biotec.tu-dresden.de

  2. Motivation All these sequences are winged helix DNA binding domains. How can we group them into families? mkdntvplkliallangefhsgeqlgetlgmsraainkhiqtlrdwgvdvftvpgkgyslpep mktvrqerlksivrilerskepvsgaqlaeelsvsrqvivqdiaylrslgynivatprgyvlagg kaltarqqevfdlirdhisqtgmpptraeiaqrlgfrspnaaeehlkalarkgvieivsgasrgirllqee mrssakqeelvkafkallkeekfssqgeivaalqeqgfdninqskvsrmltkfgavrtrnakmemvyclpaelgvptt gqrhikireiimsndietqdelvdrlreagfnvtqatvsrdikemqlvkvpmangrykyslpsdqrfnplqklkr kgqrhikireiitsneietqdelvdmlkqdgykvtqatvsrdikelhlvkvptnngsykyslpadqrfnplsklkr dvtgriaqtllnlakqpdamthpdgmqikitrqeigqivgcsretvgrilkmledqnlisahgktivvygt dikqriagffidhanttgrqtqggvivsvdftveeianligssrqttstalnslikegyisrqgrghytipnlvrlkaaa iderdkiileilekdartpfteiakklgisetavrkrvkaleekgiiegytikinpkklg elqaiapevaqslaeffavladpnrlrllsllarselcvgdlaqaigvsesavshqlrslrnlrlvsyrkqgrhvyyqlqdhhivalyqnaldhlqec mntlkkafeildfivknpgdvsvseiaekfnmsvsnaykymvvleekgfvlrkkdkryvpgyklieygsfvlrrf lfneiiplgrlihmvnqkkdrllneylsplditaaqfkvlcsircaacitpvelkkvlsvdlgaltrmldrlvckgwverlpnpndkrgvlvklttggaaiceqchqlvgqdlhqeltknltadevatleyllkkvlp nypvnpdlmpalmavfqhvrtriqseldcqrldltppdvhvlklideqrglnlqdlgrqmcrdkalitrkirelegrnlvrrernpsdqrsfqlfltdeglaihqhaeaimsrvhdelfapltpveqatlvhlldqclaaq tdilreigmiaraldsisniefkelsltrgqylylvrvcenpgiiqekiaelikvdrttaaraikrleeqgfiyrqedasnkkikriyatekgknvypiivrenqhsnqvalqglseveisqladylvrmrknvsedwefvkkg mskindindlvnatfqvkkffrdtkkkfnlnyeeiyilnhilrsesneisskeiakcsefkpyyltkalqklkdlkllskkrslqdertvivyvtdtqkaniqkliseleeyikn aitkindcfellsmvtyadklkslikkefsisfeefavltyisenkekeyylkdiinhlnykqpqvvkavkilsqedyfdkkrnehdertvlilvnaqqrkkiesllsrvnkrit miimeeakkliielfselakihglnksvgavyailylsdkpltisdimeelkiskgnvsmslkkleelgfvrkvwikgerknyyeavdgfssikdiakrkhdliaktyedlkkleekcneeekefikqkikgiermkkisekilealndld aqspagfaeeyiiesiwnnrfppgtilpaerelseligvtrttlrevlqrlardgwltiqhgkptkvnnfwets eekrsstgflvkqraflklymitmteqerlyglkllevlrsefkeigfkpnhtevyrslhellddgilkqikvkkegaklqevvlyqfkdyeaaklykkqlkveldrckkliekalsdnf hmqaeilltlklqqklfadprrisllkhialsgsisqgakdagisyksawdainemnqlsehilveratggkggggavltrygqrliqlydllaqiqqkafdvlsdddalplnsllaaisrfslqts skvtyiikasndvlnektatilitiakkdfitaaevrevhpdlgnavvnsnigvlikkglveksgdgliitgeaqdiisnaatlyaqenapellk sprivqsndlteaayslsrdqkrmlylfvdqirksdgtlqehdgiceihvakyaeifgltsaeaskdirqalksfagkevvfyrpeedagdekgyesfpwfikpahspsrglysvhinpylipffiglq nrftqfrlsetkeitnpyamrlyeslcqyrkpdgsgivslkidwiieryqlpqsyqrmpdfrrrflqvcvneinsrtpmrlsyiekkkgrqtthivfsfrdit lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelarrhl lglekrdreilevlilrfgggpvglatlatalsedpgtleevhepylirqgllkrtprgrvatelayrhlgypppv egldefdrkilktiieiyrggpvglnalaaslgveadtlsevyepyllqagflartprgrivtekaykhlkyevp iseevliglplheklfllaivrslkishtpyitfgdaeesykivceeygerprvhsqlwsylndlrekgivetrqnkrgegvrgrttlisigtepldtleavitklikeelr kyeltlqrslpfiegmltnlgamklhkihsflkitvpkdwgynritlqqlegylntladegrlkyiangsyeiv pmkteqkqeqetthknieedrklliqaaivrimkmrkvlkhqqllgevltqlssrfkprvpvikkcidiliekeylervdgekdtysyla gspekilaqiiqehregldwqeaatraslsleetrkllqsmaaagqvtllrvendlyaist eryqawwqavtraleefhsryplrpglareelrsryfsrlparvyqalleewsregrlqlaantvalagftps fsetqkkllkdledkyrvsrwqppsfkevagsfnldpseleellhylvregvlvkindefywhr qalgeareviknlastgpfglaeardalgssrkyvlplleyldqvkftrrvgdkrvvvgn vpkrvywemlatnltdkeyvrtrralileilikagslkieqiqdnlkklgfdevietiendikglintgifieikgrfyqlkdhilqfvipnrgvtkqlv irtfgwvqnpgkfenlkrvvqvfdrnskvhnevknikiptlvkeskiqkelvaimnqhdliytykelvgtgtsirseapcdaiiqatiadqgnkkgyidnwssdgflrwahalgfieyinksdsfvitdvglaysksad gsaiekeilieaissyppairiltlledgqhltkfdlgknlgfsgesgftslpegilldtlanampkdkgeirnnwegssdkyarmiggwldklglvkqgkkefiiptlgkpdnkefishafkitgeglkvlrrakgstkftr

  3. Motivation: Let's rebuild SCOP families • Given a SCOP superfamily and its sequences, how can we divide it into families? • First, we need dynamic programming to determine the sequence similarity • Then we do the following: • For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix • Next, run hierarchical clustering to cluster the sequences.

  4. Python for BioinformaticsLecture 1: Datatypes and LoopsSlides derived fromIan HolmesDepartment of StatisticsUniversity of Oxford

  5. Goals of this course • Concepts of computer programming • Rudimentary Python (widely-used language) • Introduction to Bioinformatics file formats • Practical data-handling algorithms • Exposure to Bioinformatics software

  6. Literature/Material • Textbook: Python in a Nutshell, Alex Martelli • Textbook: Python Cookbook, Alex Martelli, David Ascher (both published by O'Reilly) • Python Course in Bioinformatics, K. Schuerer/C. Letondal, Pasteur University (pdf) • a lot of online material (see course homepage http://www.biotec.tu-dresden.de/schroeder/group/teaching/bioinfo2/python.html)

  7. Files are shown in yellow The main program The program output The filename goes here Style of this lecture • The color scheme for programs, output and text files: • Interaction with the Python shell: very handy for quick tests. Helps beginners to overcome physiological barrier: Go ahead, try things out! Prompt, (python expects input here) >>> (Python Expression) (immediate Python result) Press Enter

  8. General principles of programming • Make incremental changes • Test everything you do • use the Python shell for testing expressions/functions interactively • the edit-run-revise cycle • Write so that others can read it • (when possible, write with others) • Think before you write • Use a good text editor (emacs)

  9. Python/Emacs IDE

  10. Python: Motivation • Well suited for scripting (better syntax than Perl) • However, capable of Object Orientation • Hence complex data types and large projects feasible, reuse of code (BioPython) • Universal language, Applications in and beyond bioinformatics: Amber, ProHit, PyRat, PyMOL, Gene2EST/Google, CGI, Zope • Compatible with most software technologies: GUI, MPI, OpenGL, Corba, RDB • Test complicated expressions in python shell

  11. Python basics • Basic syntax of a Python program: Lines beginning with "#" are comments, and are ignoredby Python # Elementary Python program print "Hello World" Single or double quotes enclose a "string literal" print statement tells Python to print the following stuff to the screen Hello World

  12. x = 3 print x x = "ACGCGT" print x 3 ACGCGT Variables • We can tell Python to "remember" a particular value, using the assignment operator "=": • The x is referred to as a "scalar variable". Binding site for yeast transcription factor MCB Variable names can contain alphabetic characters, numbers (but not at the start of the name), and underscore symbols "_"

  13. Variables and Objects • Everything in Python is an object • An object models a real-world entity • objects possess methods (also called functions) that are typically applied to the object, possibly parameterized • objects can also possess variables, that describe their state • e.g. x.upper()is a parameter-less method, that works on the string object x Object . Method or variable

  14. Arithmetic operations… • Basic operators are + - / * % x = 14 y = 3print "Sum: ", x + y print "Product: ", x * y print "Remainder: ", x % y Sum: 17 Product: 42 Remainder: 2 x = 5 print "x started as", x x = x * 2 print "Then x was", x x = x + 1 print "Finally x was" ,x Could write x *= 2 x started as 5 Then x was 10 Finally x was 11 Could write x += 1

  15. … Or interactively • This way, you can use Python as a calculator • Can also use += -= /= *= >>> x = 14 >>> y = 3 >>> x + y 17 >>> x * y 42 >>> x % y 2 >>> x = 5 >>> print "x started as", x x started as 5 >>> x *= 2 >>> print "Then x was", x Then x was 10 >>> x += 1 >>> print "Finally x was", x Finally x was 11 >>>

  16. String operations a = "pan" b = "cake" a = a + b print a a = "soap" b = "dish" a += b print a • Concatenation+ += • Can find the length of a string using the function len(x) pancake soapdish mcb = "ACGCGT" print "Length of %s is "%mcb, len(mcb) Length of ACGCGT is 6

  17. String formatting • Strings can be formatted with place holders for inserted strings (%s) and numbers (%d for digits and %f for floats) • Use Operator % on strings: Formatted String % Insertion Tuple >>> "aaaa%saaaa%saaa"%("gcgcgc","tttt") 'aaaagcgcgcaaaattttaaa' >>> "A range written like this: (%d - %d)" % (2,5) 'A range written like this: (2 - 5)' >>> "Or with preceeding 0's: (%03d - %04d)" % (2,5) "Or with preceeding 0's: (002 - 0005)" >>> "Rounding floats %.3f" % math.pi 'Rounding floats 3.142' >>> "Space holders: _%-7s_ and _%7s_" %("left", "right") 'Space holders: _left _ and _ right_'

  18. More string operations Convert to upper case x = "A simple sentence" print x print x.upper() print x.lower() xl=list(x) xl.reverse() print "".join(xl) x = x.replace("i", "a") print x print len(x) Convert to lower case Convert the string to a list Reverse the list Join all list members Translate "i"'s into "a"'s A simple sentence A SIMPLE SENTENCE a simple sentence ecnetnes elpmis A A sample sentence 17 Calculate the length of the string

  19. Concatenating DNA fragments dna1 = "accacgt" dna2 = "taggtct" print dna1 + dna2 accacgttaggtct "Transcribing" DNA to RNA DNA string is a mixture of upper & lower case dna = "accACgttAGGTct" rna = dna.lower().replace("t", "u") print rna Make it alllower case Replace "t" with "u" accacguuaggucu

  20. Conditional blocks • The ability to execute an action contingent on some condition is what distinguishes a computer from a calculator. In Python, this looks like this: if condition: action else: alternative Consistent, level-wise indenting important x = 149 y = 100 if x > y: print x,"is greater than",y else: print x,"is less than", y These indentations tell Python which piece of code is contingent on the condition. 149 is greater than 100

  21. Conditional operators "does not equal" Note that the test for "x equals y" is x==y, not x=y • Numeric: > >= < <= != == • The same operators work on strings as alphabetic comparisons x = 5 * 4 y = 17 + 3 if x == y: print x, "equals", y 20 equals 20 Shorthand syntax for assigning more than one variable at a time (x, y) = ("Apple", "Banana") if y > x: print y, "after", x Banana after Apple

  22. if True: print "True is true" if False: print "False is true" if -99: print "-99 is true" True is true -99 is true x = 222 if x % 2 == 0 and x % 3 == 0: print x, "is an even multiple of 3" 222 is an even multiple of 3 Logical operators • Logical operators: and and or • The keyword not is used to negate what follows. Thus not x < y means the same as x >= y • The keyword False (or the value zero) is used to represent falsehood, while True (or any non-zero value, e.g. 1) represents truth. Thus:

  23. The indented code is repeatedly executed as longas the conditionx<10 remains true x = 0 while x < 10: print x, x+=1 0 1 2 3 4 5 6 7 8 9 Loops • Here's how to print out the numbers 0 to 9: • This is a while loop.The code is executed while the condition is true. Equivalent to x = x + 1

  24. A common kind of loop • Let's dissect the code of the while loop again: • Alternatively, the forloop construct iterates through a list Initialisation x = 0 while x < 10: print x, x+=1 Test for completion Continuation Generates a list [0,1, …,9] Iteration variable for x in range(10): print x,

  25. For loop features • Loops can be used with all iteratable types, ie.: lists, strings, tuples, iterators, sets, file handlers • Stepsizes can be specified with the 3. argument of the slice constructor (negative values for iterating backwards) >>> for nucleotide in "actgc": ... print nucleotide, a c t g c >>> for number in range(50)[::7]: ... print number, 0 7 14 21 28 35 42 49 >>> for nucleotide in "actgc"[::-1]: ... print nucleotide, c g t c a

  26. Reading Data from Files • To read from a file, we can conveniently iterate through it linewise with a for-loop and the open function. Internally a filehandle is maintained during the loop. for line in open("sequence.txt"): print line, The comma prevents print's automatic newline This code snippet opens a file called"sequence.txt" in the in the current directory, and iterates through it line by line sequence.txt >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC

  27. Python for BioinformaticsLecture 2: Sequences and Lists

  28. Summary: scalars and loops • Assignment operator • Arithmetic operations • String operations • Conditional tests • Logical operators • Loops • Reading a file x = 5 y = x * 3 s = "Concatenating " + "strings" if y > 10: print s if y > 10 and not s == "": print s for x in range(10): print x for line in open("sequence.txt"): print line,

  29. Pattern-matching • A very sophisticated kind of logical test is to ask whether a string contains a pattern • e.g. does a yeast promoter sequence contain the MCB binding site, ACGCGT? 20 bases upstream of the yeast gene YBR007C name = "YBR007C" dna="TAATAAAAAACGCGTTGTCG" if "ACGCGT" in dna: print name, "has MCB!" The pattern for the MCB binding site The membership operator in YBR007C has MCB!

  30. FASTA format • A format for storing multiple named sequences in a single file • This file contains 3' UTRsfor Drosophila genes CG11604,CG11455 and CG11488 >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT>CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT Name of sequence is preceded by > symbol NB sequences can span multiple lines Call this file fly3utr.txt

  31. Printing all sequence names in a FASTA database for line in open("fly3utr.txt"): if line.startswith(">"): print line, >CG11604 >CG11455 >CG11488

  32. Finding all sequence lengths The rstrip statement trims the white space characters off the right end. Try it without this and see what happens – and if you can work out why length=0 name="" for line in open("/home/bioinf/ah/tmp/sequence.txt"): line=line.rstrip() if line.startswith(">"): if name and length: print name, length name=line[1:] length=0 else: length+=len(line) print name, length >CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT>CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT CG11604 58 CG11455 83 CG11488 69

  33. Reverse complementing DNA • A common operation due to double-helix symmetry of DNA Start by making string lower case again. This is generally good practice def revcomp(dna): replaced=list(dna.lower(). replace("a","x").replace("t","a"). replace("x", "t").replace("g","x"). replace("c","g").replace("x", "c")) replaced.reverse() return "".join(replaced) print revcomp("accACgttAGgtct") Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a' Reverse the list agacctaacgtggt

  34. Lists • A list is a list of variables • We can think of this as a list with 4 entries nucleotides = ['a', 'c', 'g', 't'] print "Nucleotides: ", nucleotides Nucleotides: ['a', 'c', 'g', 't'] a c g t the list is theset of all four elements element 0 Note that the element indices start at zero. element 3 element 1 element 2

  35. List literals • There are several, equally valid ways to assign an entire array at once. This is the most common: a comma- separated list, delimited by squared brackets a = [1,2,3,4,5] print "a = ",a b = ['a','c','g','t'] print "b = ",b c = range(1,6) print "c = ",c d = "a c g t".split() print "d = ", d a = [1,2,3,4,5] b = ['a','c','g','t'] c = [1,2,3,4,5] d = ['a','c','g','t']

  36. Accessing lists • To access list elements, use square brackets e.g. x[0] means "element zero of list x" • Remember, element indices start at zero! • Negative indices refer to elements counting from the end e.g. x[-1] means "last element of list x" x = ['a', 'c', 'g', 't'] i=2 print x[0], x[i], x[-1] a g t

  37. List operations • You can sort and reverse lists... • You can read the entire contents of a file into an array (each line of the file becomes an element of the array) x = ['a', 't', 'g', 'c'] print "x =",x x.sort() print "x =",x x.reverse() print "x =",x x = a t g c x = a c g t x = t g c a seqfile = open, "C:/sequence.txt" x = <FILE>

  38. Applying Methods to Objects • Instances of lists, strings, etc. are objects with built-in methods • Explore available methods using dir: >>> dir("hello") ['__add__', … ,'__str__', 'capitalize', 'center', 'count', 'decode', 'encode', 'endswith', 'expandtabs', 'find', 'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'replace', 'rfind', 'rindex', 'rjust', 'rsplit', 'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate', 'upper', 'zfill'] >>> help("hello".count) (…)Return the number of occurrences of substring sub in string S[start:end]. Optional arguments start and end are interpreted as in slice notation. >>> "hello".count("l") 2 List of applicable methods String object Method . (dot) applies method to object

  39. List operations Multiplying lists with * >>> x=[1,0]*5 >>> x [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] >>> while 0 in x: print x.pop(), 0 1 0 1 0 1 0 1 0 >>> x [1] >>> x.append(2) >>> x [1, 2] >>> x+=x >>> x [1, 2, 1, 2] >>> x.remove(2) >>> x [1, 1, 2] >>> x.index(2) 2 pop removes the last element of a list append adds an element to the end of a list concatenating lists with + or += Removing the first occurrence of an element Position of an element

  40. for loop revisited • Finding the total of a list of numbers: • Equivalent to: val = [4, 19, 1, 100, 125, 10] total = 0 for x in val: total += x print total for statement loops through each entry in a list 259 val = [4, 19, 1, 100, 125, 10] total = 0 for i in range(len(val)): total += val[i] print total 259

  41. Modules • Additional functionality, that is not part of the core language, is in modules like • sys (system) • re (regular expressions) • math (mathematics) • Load modules with import • You can write your own modules and import them >>> import math >>> help(math) Help on built-in module math: …

  42. The sys.argv list • A special list is sys.argv • This contains the command-line arguments if the program is invoked at the command line • It's a way for the user to pass information into the program, if you don't have a graphical interface with which to do this import sys print sys.argv File args.py ah@studipool1> python args.py abc 123 ['args.py', 'abc', '123'] Output at command line

  43. Converting a sequence into a list • The underlying programming language C treats all strings as lists >>> dna="acgtcgaga" >>> list(dna) ['a', 'c', 'g', 't', 'c', 'g', 'a', 'g', 'a'] >>> """You can also make use of long strings and the split function""".split() ['You', 'can', 'also', 'make', 'use', 'of', 'long', 'strings', 'and', 'the', 'split', 'function'] Data types can be converted. Here the list function converts a string into a list. Triple quotes allow for strings that stretch over several lines

  44. Taking a slice of a list • The syntax x[i:j] returns a list containing elements i,i+1,…,j-1 of list x nucleotides = ['a', 'g', 'c', 't'] purines = nucleotides[0:2] # nucleotides[:2] also works pyrimidines = nucleotides[2:4]# nucleotides[2:] also works print "Nucleotides:", nucleotides print "Purines:", purines print "Pyrimidines:", pyrimidines Nucleotides: ['a', 'g', 'c', 't'] Purines: ['a', 'g'] Pyrimidines: ['c', 't']

  45. Applying a function to a list • The map command applies a function to every element in an array • Similar syntax to list: map(EXPR,LIST) applies EXPR to every element in LIST • EXPR can be arbitrary function, defined elsewhere or lambda calculus expression • Lambda calculus: provides "anonymous" function, constructed with keyword lambda, a set of parameters, and an expression with these • Example: multiply every number by 3 >>> map(lambda x: x*3, [1,2,3]) [3, 6, 9]

  46. Python for BioinformaticsLecture 3: Patterns and Functions

  47. Review: pattern-matching • The following code:prints the string "Found MCB binding site!" if the pattern "ACGCGT" is present in the string variable "sequence" • We can replace the first occurrence of ACGCGT with the string _MCB_ using the following syntax: • We can replace all occurrences by omitting the optional count argument if "ACGCGT" in dna: print "Found MCB binding site!" dna.replace("ACGCGT","_MCB_", 1) pattern replacement count dna.replace("ACGCGT","_MCB_")

  48. Regular expressions • Python provides a pattern-matching engine • Patterns are called regular expressions • They are extremely powerful • Often called "regexps" for short • import module re

  49. Motivation: N-glycosylation motif • Common post-translational modification • Attachment of a sugar group • Occurs at asparagine residues with the consensus sequence NX1X2, where • X1 can be anything (but proline inhibits) • X2 is serine or threonine • Can we detect potential N-glycosylation sites in a protein sequence?

  50. Building regexps • In general square brackets denote a set of alternative possibilities • Use - to match a range of characters: [A-Z] • . matches anything • \s matches spaces or tabs • \S is anything that's not a space or tab • [^X] matches anything but X

More Related