220 likes | 448 Views
Advanced File Parsing: Regular Expressions. BCHB524 2008 Lecture 6 . Outline. Review Lecture 4 exercises Regular Expressions Protein active sites / functional domains Restriction / digestion enzymes Specialized text parsing Exercises. Review. Basic data-types: immutable
E N D
Advanced File Parsing: Regular Expressions BCHB5242008Lecture 6 BCHB524 - 2008 - Edwards
Outline • Review • Lecture 4 exercises • Regular Expressions • Protein active sites / functional domains • Restriction / digestion enzymes • Specialized text parsing • Exercises BCHB524 - 2008 - Edwards
Review • Basic data-types: immutable • Integers, floats, strings, tuples, booleans, None • Statements: • Assignment, if statements, for statements • Compound data-structures: mutable • Lists, dictionaries, sets, arrays, files • Lists ↔ Strings • Reading sequences from files, parsing NCBI tax names • Advanced iteration: • Iterables, comprehensions, generators, sorting keys • Modules: • BioPython, parsing Fasta, RefSeq, and UniProt files BCHB524 - 2008 - Edwards
Lecture 4 exercise discussion BCHB524 - 2008 - Edwards
Regular Expressions • Good HOWTO • http://py-howto.sourceforge.net/regex/regex.html • Andrew Dalke's lecture on this is superb • See link to courses on "Lecture 1 Links" post. • Useful for many, many string tasks • Most string methods can be implemented using re • Parsing, picking apart text-based formats • DNA sequence motifs • Protein sequence motifs BCHB524 - 2008 - Edwards
Regular Expressions • "Look" ugly! • Esoteric syntax • Are used (overused?) for everything in perl, linux • Can be very hard to get right • Constant source of frustration and bugs • So powerful, you just can’t afford not to know and use them. BCHB524 - 2008 - Edwards
Protein function signatures • Protein sequence suggests structure / shape • …which suggests function. • Functional protein domains have similar sequences • some very, very similar • others quite dissimilar • ProSite is a database of protein signatures • Signatures represented as consensus pattern BCHB524 - 2008 - Edwards
p53 tumor antigen protein • Many contain the string:MCNSSCMGGMNRR • Others contain the string:MCNSSCVGGMNRR import Bio.SeqIO handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if 'MCNSSCMGGMNRR' in seq or 'MCNSSCVGGMNRR' in seq: print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards
p53 tumor antigen protein • A better way: • Match MCNSSC, then M or V, then GGMNRR • [...] is list of matching residues • So, match MCNSSC[MV]GGMNRR instead. import Bio.SeqIO import re handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if re.search(r'MCNSSC[MV]GGMNRR',seq): print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards
Antennapedia signature • 'Homeobox' antennapedia-type protein signature is more interesting: • [LIVMFE] - [FY] - P - W - M - [KRQTA] • As a regular expression: • [LIVMFE][FY]PWM[KRQTA] • Some matches in human proteins: • EYPWMK, IFPWMK, VYPWMK, IYPWMR, VYPWMQ, IYPWMR, EFPWMK, IFPWMK, VYPWMR, IFPWMR, VYPWMQ, IYPWMR, LFPWMR, VYPWMK, IYPWMT, IYPWMQ, MFPWMR, IFPWMK, VYPWMK, MFPWMR BCHB524 - 2008 - Edwards
N-Glycosylation site • Pattern is N, not P, S or T, not P. • Could use (for not P): [ACDEFGHIKLMNQRSTVWY] • Better: [^P] • Caveat: [^P] includes B, J, O, Z, %, $, a, c, … if re.search(r'N[^P][ST][^P]',seq): print "glycosylation site!" BCHB524 - 2008 - Edwards
Trypsin digest site • Pattern is K or R, not P. if re.search(r'[KR][^P]',seq): print "typtic digest site!" BCHB524 - 2008 - Edwards
Barwin domain signature • Signature:C - G - [KR] - C - L - x - V - x - N • '.' (period) matches any character/residue • As a regular expression: CG[KR]CL.V.N • Matches BCHB524 - 2008 - Edwards
Repeated Residues • For example, 3 hydrophobic residues [FILAPVM][FILAPVM][FILAPVM] • Regular expression: [FILAPVM]{3} - exactly 3 hydrophobic res. [FILAPVM]{3,5} - between 3 and 5 [FILAPVM]{,3} - at most 3 [FILAPVM]{3,} - at least 3 • .{10} matches exactly 10 characters, residues • domain signatures often have spacers BCHB524 - 2008 - Edwards
Aspartic acid and asparagine hydroxylation site • Consensus pattern:C - C - x(13) - C - x(2) - [GN] - x(12) - C - x - C - x(2,4) - C • As regular expression: CC.{13}C.{2}[GN].{12}C.C.{2,4}C • . is same as .{1}, of course • Special repeat ranges: • Optional: ? is same as {0,1} • 0 or more: * is same as {0,} • 1 or more: + is same as {1,} BCHB524 - 2008 - Edwards
N- and C- terminals • We can match at start or end of sequence only • ^ matches at start of sequence • $ matches at end of sequence • Starts with methionine: • re.search(r'^M',seq) • Ends with proline codon: • re.search(r'CC.$',seq) BCHB524 - 2008 - Edwards
Regular expressions in Python • re module, • re.search(regex,string) to find a match • returns a "match" object, or None • Match objects store information about a successful match. m = re.search(r'[KR][^P]',seq) if m != None: print "typtic digest site at",(m.start()+1) BCHB524 - 2008 - Edwards
Regular expressions in Python • Groups store part of a match for later. • Indicate with (…) • Particularly useful with variable length matches pattern = r'[ASD]{3,5}([LI])[^P]{2,5}' seq = "EASALWTRD" m = re.search(pattern,seq) if m != None: print m.start(),m.end() print m.start(1),m.end(1) print m.group(1) BCHB524 - 2008 - Edwards
Groups are great for parsing • Check for a match, and then pick out the piece you need if match succeeds dbxrefs = ['EMBL:CR940353', 'RefSeq:XP_953099.1', 'GeneID:3863060', 'KEGG:tan:TA08425', 'GO:GO:0005886', 'InterPro:IPR007480', 'Pfam:PF04385'] for r in dbxrefs: m = re.search(r'^RefSeq:([NX]P_[0-9]+)\.[0-9]+$',r) if m != None: print "RefSeq accession is",m.group(1) BCHB524 - 2008 - Edwards
Lab exercises • Try each of the examples shown in these slides. • Read through the Python Regular Expression HOWTO and Andrew Dalke's lecture "Searching and Regular Expressions" • Write a regular expression to match the codons that code for each amino-acid. • Note: S, R and L are hard! BCHB524 - 2008 - Edwards
Lab exercises • Construct regular expressions for the restriction enzyme motifs: • GANTC, where N represnts A,C,T, or G • CCWGG, where W represents A or T • Write a program to chop a protein sequence into tryptic peptides. • Print out each tryptic peptide, as well as its start and end position. BCHB524 - 2008 - Edwards
Lab exercises • The GN "line" in a SwissProt entry lists various types of gene names for the protein • A BioPython seq_record object stores this in the dictionary seq_record.annotations, with key 'gene_name'. • Find which SwissProt entries with a gene name denoted "Name" using a regular expression • For those with a "Name" gene name, extract the gene name and print out the protein's id, and the gene name. • Try the above without BioPython. • Try the above without regular expressions! BCHB524 - 2008 - Edwards