1 / 68

Parsing data records

Parsing data records. > sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY

egil
Download Presentation

Parsing data records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsing data records

  2. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

  3. A sequence record in FASTA format >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

  4. seq = ">sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens \ MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS\ WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY\ LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY\ YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD\ AGEGEN" for i in seq: print i

  5. seq = open("SingleSeq.fasta") for line in seq: print line

  6. seq = open("SingleSeq.fasta") seq_2 = open("SingleSeq-2.fasta") for line in seq: seq_2.write(line) seq_2.close()

  7. Writing different things depending on a condition Read a sequence in FASTA format and print only the header of the sequence >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

  8. seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN

  9. Making choices: The if/elif/else statements if <condition 1>: if expression in <condition1> is TRUE <statements 1> execute statements 1 [elif <condition 2>]: else if exp in <condition2> is TRUE <statements 2>] execute statements 2.... [elif <condition 3>]: etc... pass] … [else: <statements N>]

  10. Write different things depending on a condition >>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num= s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G 0.116666666667 >>> print freq_A 0.216666666667

  11. Write different things depending on a condition >>> s="MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAFAPAAAE" >>> s_len = float(len(s)) >>> G_num = s.count('G’) >>> A_num = s.count('A’) >>> freq_G = G_num/s_len >>> freq_A = A_num/s_len >>> print freq_G 0.116666666667 >>> print freq_A 0.216666666667 >>> if freq_G > freq_A: ... print "Gly is more frequent than Ala" ... eliffreq_G < freq_A: ... print "Ala is more frequent than Gly" ... else: ... print "The frequency of Gly and Ala is the same" ... Ala is more frequent than Glycines

  12. The if/elif/else construct produces different effects compared with the use of a series ofif conditions

  13. seq = open("SingleSeq.fasta") for line in seq: if line[0] == '>': print line seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line

  14. seq = open("SingleSeq.fasta") for line in seq: if line[0] != '>': print line == != => <= > <

  15. Exercises 1, 2, and 3 1) Read a file in FASTA format and write to a new file only the header of the record. 2) Read a file in FASTA format and write to a new file only the sequence (without the header). 3) Merge 1) and 2). In other words, read a file in FASTA format and write the header to a file and the sequence to a different one.

  16. fasta = open('SingleSeq.fasta') header = open('header.txt', 'w’) for line in fasta: if line[0] == '>': header.write(line) header.close()

  17. fasta = open('SingleSeq.fasta') seq= open('seq.txt','w') for line in fasta: if line[0] != '>': seq.write(line) seq.close()

  18. fasta = open('SingleSeq.fasta') header = open('header.txt', 'w') seq = open('seq.txt','w') for line in fasta: if line[0] == '>': header.write(line) else: seq.write(line) header.close() seq.close()

  19. Let’s increase the difficulty just a bit…

  20. seq_fasta = open("SingleSeq.fasta") seq = '' for line in seq_fasta: if line[0] == '>': header = line else: seq = seq + line.strip() num_cys = seq.count("C") print header, seq, num_cys

  21. Exercise 4 4) Read a file in FASTA format. Print or write the record to a file only if the sequence is from Homo sapiens.

  22. seq_fasta = open("SingleSeq.fasta") seq = '' header = '' for line in seq_fasta: if line[0] == '>': if "Homo sapiens" in line: header = line else: if header: seq = seq + line if header: print header + seq else: print "The record is not from H. sapiens"

  23. In general, you will need to analyse several sequences….

  24. SwissProt-Human.fasta >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN ... ... ...

  25. Read the records from a file and write them to a new file fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') for line in fasta: fasta_2.write(line) this must be a string

  26. Strings can be concatenated >>> print "ACTGGTA" + "ATGTAACTT" ACTGGTAATGTAACTT Strings can be indexed and sliced >>> s = "ACTGGTA" >>> s[0] 'A' >>> s[1:3] 'CT' String elements cannot be re-assigned >>> s[2] = 'Z' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object does not support item assignment

  27. Read the sequences from a file and write them to a new file Number the lines starting from 1 fasta = open('SwissProt-Human.fasta') fasta_2 = open('SwissProt-Human_2.fasta', 'w') n = 0 for line in fasta: n = n + 1 l_n = str(n) fasta_2.write(l_n + "\t" + line) fasta_2.close()

  28. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN ... ... ...

  29. Exercise 5 5) Download a Uniprot multiple sequence FASTA file. Write the record headers to a new file.

  30. fasta = open('SwissProt-Human.fasta') headers = open('headers.txt', 'w') for line in fasta: if line[0] == '>': headers.write(line) headers.close()

  31. >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN ... ... ...

  32. Exercise 6 6) Read a multiple sequence FASTA file and write the sequences to a new file separated by a blank line

  33. fasta = open('SwissProt-Human.fasta.fasta') seqs = open('seqs.txt', 'w') for line in fasta: if line[0] == '>’: seqs.write('\n') elifline[0] != '>': seqs.write(line) seqs.close() seqs.write(line.strip() + '\n’)

  34. Exercise 7 7) Read a multiple sequence FASTA file and write to a new file only the records from Homo sapiens.

  35. fasta = open('sprot_prot.fasta') output = open('homo_sapiens.fasta', 'w') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line elif line[0] != '>': seq = seq + line elif line[0] == '>' and seq != '': if "Homo sapiens" in header: output.write(header + seq) header = line seq = '' if "Homo sapiens" in header: output.write(header + seq) output.close()

  36. Exercise 8 8) Read FASTA records from a file and count the cysteine residues in each sequence.

  37. Read the records from a file and count the cysteine residues in each sequence fasta = open('sprot_prot.fasta') seq = '' for line in fasta: if line[0] == '>' and seq == '': header = line[4:10] elif line[0] != '>': seq = seq + line.strip() elif line[0] == '>' and seq != '': cys_num = seq.count('C') print header, ': ', cys_num header = line[4:10] seq = '' print header, ': ', cys_num

  38. Exercises 9, 10, and 11 9) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences starting with a methionine ('M'). 10) Read a multiple sequence file in FASTA format and write to a new file only the records of the sequences having at least two tryptophan residues ('W'). 11) Read a multiple sequence file in FASTA format and write to a new file only the records the sequences of which start with a methionine ('M') and have at least two tryptophans ('W').

  39. outfile = open('SwissProtHuman-Filtered.fasta','w') fasta = open('SwissProtHuman.fasta','r') seq = '' for line in fasta: if line[0:1] == '>' and seq == '': header = line elif line [0:1] != '>': seq= seq + line elif line[0:1] == '>' and seq != '': TRP_num= seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) seq= '' header = line TRP_num = seq.count('W') if seq[0] == 'M' and TRP_num > 1: outfile.write(header + seq) outfile.close()

  40. In many cases you will need to compare data from different files

  41. SwissProt-Human.fasta >sp|P31946|1433B_HUMAN 14-3-3 protein beta/alpha OS=Homo sapiens MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD AGEGEN >sp|P62258|1433E_HUMAN 14-3-3 protein epsilon OS=Homo sapiens MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASW RIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVF YYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIRLGLALNFSVF YYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGE EQNKEALQDVEDENQ >sp|Q04917|1433F_HUMAN 14-3-3 protein eta OS=Homo sapiens GN=YWHAH MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSW RVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESK VFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHPIRLGLALNFS VFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDE EAGEGN ... ... ... cancer-expressed.txt

  42. 1) Read 10 SwissProt ACs from a file 2) Store them into a data structure cancer_file = open('cancer-expressed.txt') cancer_list = [] for line in cancer_file: AC = line.strip() cancer_list.append(AC) print cancer_list

  43. List data structure A list is a mutableordered collection of objects The elements of a list can be any kind of object: numbers strings tuples lists dictionaries function calls etc. L = [1, [2,3], 4.52, ‘DNA’] L = [] The empty list

  44. >>> L = [1,”hello”,12.1,[1,2,”three”],”seq”,(1,2)] >>> L[0] # indexing 1 >>> L[3] # indexing [1, 2, ’three'] >>> L[3][2] # indexing ‘three’ >>> L[-1] # negative indexing (1, 2) >>> L[2:4] # slicing [12.1, [1, 2, ‘three’]] >>> L[2:] # slicing shorthand [12.1, [1, 2, ‘three’], ‘seq’, (1, 2)] >>>

  45. The elements of a list can be changed/replaced after the list has been defined These operations CHANGE the list >>> l = [2,3,5,7,8,['a','b'],'a','b','cde'] >>> l[0] = 1 >>> l [1, 3, 5, 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> l[0:3] = 'DNA' >>> l ['D', 'N', 'A', 7, 8, ['a', 'b'], 'a', 'b', 'cde'] >>> del l[0:5] >>> l [['a', 'b'], 'a', 'b', 'cde'] >>> l.append('DNA') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA'] >>> l.extend('dna') >>> l [['a', 'b'], 'a', 'b', 'cde', 'DNA', 'd', 'n', 'a'] >>>

  46. The elements of a list can be changed/replaced after the list has been defined >>> l = [1,3,5,7,8,['a','b'],'a','b','cde'] >>> l.count(‘a’) >>> l 1 >>> l.index(8) 4 >>> l.insert(4, 80) >>> l [1, 3, 5, 7, 80, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop(4) 80 >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’, ‘cde’] >>> l.pop() ‘cde’ >>> l [1, 3, 5, 7, 8, [‘a’, ‘b’], ‘a’, ‘b’] >>> l.remove(8) [1, 3, 5, 7, [‘a’, ‘b’], ‘a’, ‘b’]

  47. The elements of a list can be changed/replaced after the list has been defined >>> l = [4, 3, 2, 1, 5, 6, 7, 8] >>> l.reverse() >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> new = sorted(l) >>> new [1, 2, 3, 4, 5, 6, 7, 8] >>> l [8, 7, 6, 5, 1, 2, 3, 4] >>> l.sort() >>> l [1, 2, 3, 4, 5, 6, 7, 8]

  48. Putting together lists and loops range()andxrange() built-in functions >>> range(10) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] >>> range(1, 11) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> range(0, 30, 5) [0, 5, 10, 15, 20, 25] >>> range(0, 10, 3) [0, 3, 6, 9] >>> range(0, -10, -1) [0, -1, -2, -3, -4, -5, -6, -7, -8, -9] >>> range(0) [] >>> range(1, 0) [] # the xrange()method ismore commonly used in for loops than range() >>>for i in xrange(5): … print i … 0,1,2,3,4 Thexrange()method generates the values upon call, i.e. it does not store them into a variable

More Related