Advanced File Parsing: Regular Expressions

Advanced File Parsing: Regular Expressions BCHB5242008Lecture 6 BCHB524 - 2008 - Edwards

Outline • Review • Lecture 4 exercises • Regular Expressions • Protein active sites / functional domains • Restriction / digestion enzymes • Specialized text parsing • Exercises BCHB524 - 2008 - Edwards

Review • Basic data-types: immutable • Integers, floats, strings, tuples, booleans, None • Statements: • Assignment, if statements, for statements • Compound data-structures: mutable • Lists, dictionaries, sets, arrays, files • Lists ↔ Strings • Reading sequences from files, parsing NCBI tax names • Advanced iteration: • Iterables, comprehensions, generators, sorting keys • Modules: • BioPython, parsing Fasta, RefSeq, and UniProt files BCHB524 - 2008 - Edwards

Lecture 4 exercise discussion BCHB524 - 2008 - Edwards

Regular Expressions • Good HOWTO • http://py-howto.sourceforge.net/regex/regex.html • Andrew Dalke's lecture on this is superb • See link to courses on "Lecture 1 Links" post. • Useful for many, many string tasks • Most string methods can be implemented using re • Parsing, picking apart text-based formats • DNA sequence motifs • Protein sequence motifs BCHB524 - 2008 - Edwards

Regular Expressions • "Look" ugly! • Esoteric syntax • Are used (overused?) for everything in perl, linux • Can be very hard to get right • Constant source of frustration and bugs • So powerful, you just can’t afford not to know and use them. BCHB524 - 2008 - Edwards

Protein function signatures • Protein sequence suggests structure / shape • …which suggests function. • Functional protein domains have similar sequences • some very, very similar • others quite dissimilar • ProSite is a database of protein signatures • Signatures represented as consensus pattern BCHB524 - 2008 - Edwards

p53 tumor antigen protein • Many contain the string:MCNSSCMGGMNRR • Others contain the string:MCNSSCVGGMNRR import Bio.SeqIO handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if 'MCNSSCMGGMNRR' in seq or 'MCNSSCVGGMNRR' in seq: print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards

p53 tumor antigen protein • A better way: • Match MCNSSC, then M or V, then GGMNRR • [...] is list of matching residues • So, match MCNSSC[MV]GGMNRR instead. import Bio.SeqIO import re handle = open("sprot_chunk.dat") for seq_record in Bio.SeqIO.parse(handle, "swiss"): seq = seq_record.seq.tostring() if re.search(r'MCNSSC[MV]GGMNRR',seq): print seq_record.id, "is a p53 tumor antigen." handle.close() BCHB524 - 2008 - Edwards

Antennapedia signature • 'Homeobox' antennapedia-type protein signature is more interesting: • [LIVMFE] - [FY] - P - W - M - [KRQTA] • As a regular expression: • [LIVMFE][FY]PWM[KRQTA] • Some matches in human proteins: • EYPWMK, IFPWMK, VYPWMK, IYPWMR, VYPWMQ, IYPWMR, EFPWMK, IFPWMK, VYPWMR, IFPWMR, VYPWMQ, IYPWMR, LFPWMR, VYPWMK, IYPWMT, IYPWMQ, MFPWMR, IFPWMK, VYPWMK, MFPWMR BCHB524 - 2008 - Edwards

N-Glycosylation site • Pattern is N, not P, S or T, not P. • Could use (for not P): [ACDEFGHIKLMNQRSTVWY] • Better: [^P] • Caveat: [^P] includes B, J, O, Z, %, $, a, c, … if re.search(r'N[^P][ST][^P]',seq): print "glycosylation site!" BCHB524 - 2008 - Edwards

Trypsin digest site • Pattern is K or R, not P. if re.search(r'[KR][^P]',seq): print "typtic digest site!" BCHB524 - 2008 - Edwards

Barwin domain signature • Signature:C - G - [KR] - C - L - x - V - x - N • '.' (period) matches any character/residue • As a regular expression: CG[KR]CL.V.N • Matches BCHB524 - 2008 - Edwards

Repeated Residues • For example, 3 hydrophobic residues [FILAPVM][FILAPVM][FILAPVM] • Regular expression: [FILAPVM]{3} - exactly 3 hydrophobic res. [FILAPVM]{3,5} - between 3 and 5 [FILAPVM]{,3} - at most 3 [FILAPVM]{3,} - at least 3 • .{10} matches exactly 10 characters, residues • domain signatures often have spacers BCHB524 - 2008 - Edwards

Aspartic acid and asparagine hydroxylation site • Consensus pattern:C - C - x(13) - C - x(2) - [GN] - x(12) - C - x - C - x(2,4) - C • As regular expression: CC.{13}C.{2}[GN].{12}C.C.{2,4}C • . is same as .{1}, of course • Special repeat ranges: • Optional: ? is same as {0,1} • 0 or more: * is same as {0,} • 1 or more: + is same as {1,} BCHB524 - 2008 - Edwards

N- and C- terminals • We can match at start or end of sequence only • ^ matches at start of sequence • $ matches at end of sequence • Starts with methionine: • re.search(r'^M',seq) • Ends with proline codon: • re.search(r'CC.$',seq) BCHB524 - 2008 - Edwards

Regular expressions in Python • re module, • re.search(regex,string) to find a match • returns a "match" object, or None • Match objects store information about a successful match. m = re.search(r'[KR][^P]',seq) if m != None: print "typtic digest site at",(m.start()+1) BCHB524 - 2008 - Edwards

Regular expressions in Python • Groups store part of a match for later. • Indicate with (…) • Particularly useful with variable length matches pattern = r'[ASD]{3,5}([LI])[^P]{2,5}' seq = "EASALWTRD" m = re.search(pattern,seq) if m != None: print m.start(),m.end() print m.start(1),m.end(1) print m.group(1) BCHB524 - 2008 - Edwards

Groups are great for parsing • Check for a match, and then pick out the piece you need if match succeeds dbxrefs = ['EMBL:CR940353', 'RefSeq:XP_953099.1', 'GeneID:3863060', 'KEGG:tan:TA08425', 'GO:GO:0005886', 'InterPro:IPR007480', 'Pfam:PF04385'] for r in dbxrefs: m = re.search(r'^RefSeq:([NX]P_[0-9]+)\.[0-9]+$',r) if m != None: print "RefSeq accession is",m.group(1) BCHB524 - 2008 - Edwards

Lab exercises • Try each of the examples shown in these slides. • Read through the Python Regular Expression HOWTO and Andrew Dalke's lecture "Searching and Regular Expressions" • Write a regular expression to match the codons that code for each amino-acid. • Note: S, R and L are hard! BCHB524 - 2008 - Edwards

Lab exercises • Construct regular expressions for the restriction enzyme motifs: • GANTC, where N represnts A,C,T, or G • CCWGG, where W represents A or T • Write a program to chop a protein sequence into tryptic peptides. • Print out each tryptic peptide, as well as its start and end position. BCHB524 - 2008 - Edwards

Lab exercises • The GN "line" in a SwissProt entry lists various types of gene names for the protein • A BioPython seq_record object stores this in the dictionary seq_record.annotations, with key 'gene_name'. • Find which SwissProt entries with a gene name denoted "Name" using a regular expression • For those with a "Name" gene name, extract the gene name and print out the protein's id, and the gene name. • Try the above without BioPython. • Try the above without regular expressions! BCHB524 - 2008 - Edwards

Advanced File Parsing: Regular Expressions

Advanced File Parsing: Regular Expressions

Presentation Transcript

Regular Expressions

Regular Expressions

Advanced Regular Expressions

Regular Expressions

Regular Expressions

Regular expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions

Regular Expressions and XML Parsing

Regular Expressions

Regular expressions

Top-Down Parsing using Regular Expressions

Regular Expressions