MATLAB Bioinformatics Tools

MATLAB Bioinformatics Tools Rob Henson The MathWorks, Inc.

Who Am I? • Development manager for Bioinformatics group at The MathWorks • Natick, MA • Software developer • Background in algorithm design and software engineering

What do I do? • Write software for bioinformatics • Sequence analysis • Microarray data analysis • Some consulting • Bioinformatics algorithm design • Machine learning tools • E.g. Neural networks, HMMs etc.

My solution to dotplot >> map = eye(128); >> spy(map(seq1,seq2)) Why does this work? How could we make this better?

Enhancements to dotplot • Does map need to be 128? • What is the right value? • Can we use less memory? • How do we deal with bad inputs? • Can we extend this to look for longer patterns?

Some useful tools • edit • dbstop • profiler • Getting help • Documentation • Technical Support Knowledge Base • Newsgroup

A full implementation of dotplot function matches = dotplot(seq1,seq2,window,stringency) % DOTPLOT Visualize sequence matches. % DOTPLOT(S,T) plots the sequence matches of sequences S and T. % % DOTPLOT(S,T,WINDOW,NUM) plots sequence matches when there % are at least NUM matches in a window of size WINDOW. For nucleotide % sequences a WINDOW of 11 and NUM of 7 is recommended in the % literature. % % MATCHES = DOTPLOT(...) returns the number of dots in the dotplot % matrix. % % Example: % moufflon = getgenbank('AB060288','sequence',true); % takin = getgenbank('AB060290','sequence',true); % dotplot(moufflon,takin,11,7) % % This shows the similarities between prion protein (PrP) nucleotide % sequences of two ruminants, the moufflon and the golden takin. % % See also NWALIGN, SWALIGN.

Sequence properties • Amino acid composition • histc function • Molecular weight • Indexing and sum function • Hydrophobicity

Molecular weights A: 89.000 R: 174.000 N: 132.000 D: 133.000 D: 121.000 Q: 146.000 E: 147.000 G: 75.000 H: 155.000 I: 131.000 L: 131.000 K: 146.000 M: 149.000 F: 165.000 P: 115.000 S: 105.000 T: 119.000 W: 204.000 Y: 181.000 V: 117.000 http://cn.expasy.org/tools/pscale/Molecularweight.html

mw = [89.0900 0 121.1500 133.1000 147.1300 165.1900 75.0700 155.1600 131.1700 0 146.1900 131.1700 149.2100 132.1200 0 115.1300 146.1500 174.2000 105.0900 119.1200 0 117.1500 204.2300 0 181.1900]; seq = ‘MATLAPEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSP’; seqmw = mw(seq-’A’+1); plot(seqmw)

proteinplot

Assignments 1. Create a hydrophobicity plot You can get the amino acid values from http://cn.expasy.org/cgi-bin/protscale.pl Use Kyte & Doolittle’s values. Create a function that has two inputs, the sequence and the window size. The function will create a hydrophobicity plot. The help for the function is on the next slide…

function hydrophobic(sequence, window_length) % HYDROPHOBIC plots the hydrophobicity of an amino acid sequence % HYDROPHOBIC(SEQUENCE,WINDOW_LENGTH) creates a hydrophobicity plot of % SEQUENCE using a smoothing window of length, WINDOW_LENGTH. % % SEQUENCE must be a valid amino acid sequence. If SEQUENCE contain any % symbols other than the standard 20 amino acid letters, the function % will give an error message. SEQUENCE can be either upper or lower case. % % WINDOW_LENGTH must be an odd positive integer. %

Assignments 2. Modify the function to return the maximum and minimum hydrophobicity values in the plot. Make appropriate changes to the help for the function.

Advanced example • Alignment significance • Alignment algorithms such as Smith-Waterman and Needleman-Wunsch always find some alignment. How do we know if what they find is significant or simply random?

MATLAB Bioinformatics Tools