280 likes | 427 Views
Michael Smith. Regular Expressions. Not regular facial expressions. Regular expressions help us find the information we want They are incredibly powerful They are vital to the field of bioinformatics. Computer Scientists use them nearly every day as a filter
E N D
Michael Smith Regular Expressions
Regular expressions help us find the information we want • They are incredibly powerful • They are vital to the field of bioinformatics
Computer Scientists use them nearly every day as a filter • You use regular expressions every day • Humans are great at them for small matches, but computers excel when matching large text files, especially from databases
‘grep’ made regular expressions popular • It takes a bunch of text and prints lines with matching regular expressions
Fundamentals • Often shortened to ‘regex’ or ‘regexp’, or called a pattern • A string is made up of characters – numbers, letters, and symbols • Regex describe a set of strings • Have metacharacters that mean special things
Fundamentals cont… • Searching for many strings with one string • Based on 3 ideas • Repetition: An asterisk (*) indicates 0 or more repetitions of the character before it • Alternation: A pattern like (a | b) matches the string ‘a’ or ‘b’ • Concatenation: a string (ab) means ‘a’ followed by ‘b’
Motifs • One of the most common tasks in bioinformatics is looking for motifs, short segments of DNA or protein of particular interest • Often times the motifs we look for are not one specific sequence, they can have several variants
Motifs cont… • Motif databases have commonly been used to: • Classify proteins • Provide functional alignment • Identify structural and evolutionary relationships
Perl • Perl (a programming language) has powerful text processing power • Easily manipulates text files For my tutorial I will be using Perl, so you need to understand some special syntax
Perl Syntax • ‘$’ is the symbol for a scalar. A scalar is a single value (a number, string, or reference) • ‘=~’ is the symbol to say “apply the operation on the right to the string in the variable on the left” and is known as a binding • A period symbol (‘.’) can stand for any character except a newline.
Perl Regex Syntax • The match operator is m//. It will return true or false • The substitution operator is s///. It returns a string • Regular expressions can have ‘modifiers’ they modify the meaning of the expression. They come after the slashes
Regular expressions are used in many programming languages. Because perl uses them so elegantly, other languages have modeled their own implementation off of it.
What does this mean for you? • You can find patterns in large databases! • Just like Andrew’s presentation on biopython, there exists a bioperl module • Sequence manipulation • Accessing web databases • Parsing of the results • Open source
Use Bio::Perl; $seq_object = get_sequence(‘swiss’,”ROA1_HUMAN”); This program would get the ROA1_HUMAN sequence from the swiss database
Available functions • Get_sequence • Read_sequence • Read_all_sequences • New_sequence • Write_sequence • Translate • Translate_as_string • Blast_sequence • Write_blast
But wait, there’s more! • You don’t have to program to find useful information
Database Patterns • http://expasy.org/tools/scanprosite/ • Sites like this have different regular expression ‘symbols’ than perl, but use the same concepts • One-letter codes for amino acids • Symbol ‘x’ is a wildcard • Alternation is provided by the ‘[]’ brackets • Negated alternation is provided by the ‘{}’ brackets
A ‘-’ is just a separator • X(3) = x-x-x • A(2,4) = A-A or A-A-A or A-A-A-A • Examples : [AC]-x-V-x(4)-{ED}This pattern is translated as: [Ala or Cys]-any-Val-any-any-any-any-{any but Glu or Asp} • (this means a lot more to biologists than me I imagine)
My Query • http://en.wikipedia.org/wiki/Amino_acid • Uses standard amino acid abbr. • [ACG]-XXAG-V-X(4)-{AEGD} • [Alanine or cysteine or glycine], any, any, alanine, glycine, valine, any, any, any, any, {not alanine, glutamic acid, glycine, or aspartic acid}
Results • A LOT of hits
According to nature.com, the real power of databases is the ability to unearth patterns hidden across different types of data. • Databases are starting to be geared specifically for life sciences such as patter recognition functions • Built-in BLAST search • Regular expressions for complex word-pattern matching
Uses • As long as biocomputing has been of interest, regular expressions have been used for sequence alignment. • I found an article as recently as 2007 that uses probabilities, gaps, and local optimization combined with regular expressions with results comparable to CLUSTALW
In Conclusion • We discussed how we use regular expressions every day • We explored their practical uses in a field like bioinformatics • We learned how to write simple programs that quickly perform very borings tasks for humans • You don’t have to be a computer scientists to unlock their power!
Extra • http://www.ncbi.nlm.nih.gov/pubmed/19534754 - Article June 2009, Regular expression Blasting algorithm