190 likes | 340 Views
COMP 5115 Programming Tools in Bioinformatics Week 7 Handling Multiple Strings. Any of the MATLAB regular expression functions can be used with cell arrays of strings and single strings.
E N D
COMP 5115 Programming Tools in BioinformaticsWeek 7Handling Multiple Strings Any of the MATLAB regular expression functions can be used with cell arrays of strings and single strings. Any or all of the input parameters (the string, expression, or replacement string) can be a cell array of strings. The regexp function requires that the string and expression arrays have the same number of elements if both are vectorised (i.e., if they have dimensions greater than 1-by-N). The regexprep function requires that the expression and replacement arrays have the same number of elements if the replacement array is vectorised. (The cell arrays do not have to have the same shape.) Whenever the first input parameter to a regular expression function is a cell array, all output values are cell arrays of the same size
Finding a Single Pattern in Multiple Strings The example shown here uses the regexp function on a cell array of strings cstr. It searches each string of the cell array for consecutive matching letters (e.g., 'oo'). The function returns a cell array of the same size as the input array. Each row of the return array contains the indices for which there was a match against the input cell array. The input cell array cstr = { ... 'Whose woods these are I think I know.‘ ; ... 'His house is in the village though;' ; ... 'He will not see me stopping here' ; ... 'To watch his woods fill up with snow.'} ; In the input cell array (cstr), find consecutive matching letters by capturing a letter as a token (.) and then repeating that letter as a token reference, \1:
idx = regexp(cstr, '(.)\1'); whos idx Name Size Bytes Class idx 4x1 296 cell array idx{:} ans = % 'Whose woods these are I think I know.' 8 % |8 ans = % 'His house is in the village though;' 23 % |23 ans = % 'He will not see me stopping here' 6 14 23 % |6 |14 |23 ans = % 'To watch his woods fill up with snow.' 15 22 % |15 |22 To return substrings instead of indices, use the 'match' parameter: mat = regexp(cstr, '(.)\1', 'match'); mat{3} ans = ‘l l‘ 'ee‘ 'pp'
Finding Multiple Patterns in Multiple Strings A cell array of strings in both the input string and the expression will be used. The two cell arrays are of different shapes: cstr is 4-by-1 while expr is 1-by-4. The command is valid as long as they both have the same number of cells. Find uppercase or lowercase 'i' followed by a white-space character in str{1}, the sequence 'hou' in str{2}, two consecutive matching letters in str{3}, and words beginning with 'w' followed by a vowel in str{4}. expr = {'i\s', 'hou', '(.)\1', '\<w[aeiou]'}; idx = regexpi(cstr, expr); idx{:} ans = % 'Whose woods these are I think I know.' 23 31 % |23 |31 ans = % 'His house is in the village though;' 5 30 % |5 |30 ans = % 'He will not see me stopping here' 6 14 23 % |6 |14 |23 ans = % 'To watch his woods fill up with snow.' 4 14 28 % |4 |14 |28
Replacing Multiple Strings When replacing multiple strings with regexprep, use a single replacement string if the expression consists of a single string. This example uses a common replacement value ('--') for all matches found in the multiple string input cstr. The function returns a cell array of strings having the same dimensions as the input cell array: s = regexprep(cstr, '(.)\1', '--', 'ignorecase') s = ‘Whose w--ds these are I think I know.' 'His house is in the vi--age though;' 'He wi-- not s-- me sto--ing here' 'To watch his w--ds fi-- up with snow.' Multiple replacement strings can be used if the expression consists of multiple strings. In this example, the input string and replacement string are both 4-by-1 cell arrays, and the expression is a 1-by-4 cell array. As long as the expression and replacement arrays contain the same number of elements, the statement is valid.
Replacing Multiple Strings The dimensions of the return value match the dimensions of the input string: expr = {'i\s', 'hou', '(.)\1', '\<w[aeiou]'}; repl = {'-1-'; '-2-'; '-3-'; '-4-'}; s = regexprep(cstr, expr, repl, 'ignorecase') s = 'Whose w-3-ds these are -1-think -1-know.' 'His -2-se is in the vi-3-age t-2-gh;' 'He -4--3- not s-3- me sto-3-ing here' 'To -4-tch his w-3-ds fi-3- up -4-th snow.'
Characters and Strings In MATLAB, the term string refers to an array of Unicode characters. MATLAB represents each character internally as its corresponding numeric value. Unless you want to access these values, however, you can simply work with the characters as they display on screen. You can use char to hold an m-by-n array of strings as long as each string in the array has the same length. (This is because MATLAB arrays must be rectangular.) To hold an array of strings of unequal length, use a cell array. The string is actually a vector whose components are the numeric codes for the characters. The actual characters displayed depend on the character set encoding for a given font.
Creating Character Arrays Specify character data by placing characters inside a pair of single quotes. For example, this line creates a 1-by-13 character array called name: name = 'Thomas R. Lee'; In the workspace, the output of whos shows Name Size Bytes Class name 1x13 26 char array You can see that each character uses two bytes of storage internally. The class and ischar functions show name's identity as a character array: class(name) ans = char ischar(name) ans = 1
You can also join two or more character arrays together to create a new character array. Use either the string concatenation function, strcat, or the MATLAB concatenation operator, [ ], to do this. The concatenation operator [ ] preserves any trailing spaces found in the input arrays: name = 'Thomas R. Lee'; title = ' Sr. Developer'; strcat(name,',',title) ans = Thomas R. Lee, Sr. Developer strvcat can be used to concatenate strings vertically. S = strvcat(t1, t2, t3, ...) forms the character array S containing the text strings (or string matrices) t1,t2,t3,... as rows. Spaces are appended to each string as necessary to form a valid matrix. Empty arguments are ignored. The command strvcat('Hello','Yes') is the same as ['Hello';'Yes '], except that strvcat performs the padding automatically.
Creating Two-Dimensional Character Arrays When creating a two-dimensional character array, be sure that each row has the same length. For example, this line is legal because both input rows have exactly 13 characters: name = ['Thomas R. Lee' ; 'Sr. Developer'] name = Thomas R. Lee Sr. Developer When creating character arrays from strings of different lengths, you can pad the shorter strings with blanks to force rows of equal length: name = ['Thomas R. Lee '; 'Senior Developer']; A simpler way to create string arrays is to use the char function. S = char(X) converts the array X that contains positive integers representing character codes into a MATLAB character array (the first 127 codes are ASCII). The actual characters displayed depend on the character set encoding for a given font. The result for any elements of X outside the range from 0 to 65535 is not defined (and can vary from platform to platform). Use double to convert a character array into its numeric codes. S = char(C), when C is a cell array of strings, places each element of C into the rows of the character array s. Use cellstr to convert back. S = char(t1, t2, t3, ...) forms the character array S containing the text strings T1, T2, T3, ... as rows, automatically padding each string with blanks to form a valid matrix. Each text parameter, Ti, can itself be a character array. This allows the creation of arbitrarily large character arrays. Empty strings are significant.
Creating Two-Dimensional Character Arrays In this example, char pads the 13-character input string 'Thomas R. Lee' with three trailing blanks so that it will be as long as the second string: name = char('Thomas R. Lee','Senior Developer') name = Thomas R. Lee Senior Developer When extracting strings from an array, use the deblank function to remove any trailing blanks: trimname = deblank(name(1,:)) trimname = Thomas R. Lee size(trimname) ans = 1 13 The deblank function is useful for cleaning up the rows of a character array. str = deblank(str) removes the trailing blanks from the end of a character string str. c = deblank(c), when c is a cell array of strings, applies deblank to each element of c.
Cell Arrays of Strings Creating strings in a regular MATLAB array requires that all strings in the array be of the same length. This often means that you have to pad blanks at the end of strings to equalize their length. However, another type of MATLAB array, the cell array, can hold different sizes and types of data in an array without padding. Cell arrays provide a more flexible way to store strings of varying length. The cellstr function converts a character array into a cell array of strings. Let’s consider the following character array data = ['Allison Jones';'Development ';'Phoenix ']; Each row of the matrix is padded so that all have equal length (in this case, 13 characters). Now use cellstr to create a column vector of cells, each cell containing one of the strings from the data array: celldata = cellstr(data) celldata = 'Allison Jones' 'Development' 'Phoenix' details on cell arrays on http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/ch_da37a.html#67323
Cell Arrays of Strings Note that the cellstr function strips off the blanks that pad the rows of the input string matrix: length(celldata{3}) ans = 7 The iscellstr function determines if the input argument is a cell array of strings. It returns a logical 1 (true) in the case of celldata: iscellstr(celldata) ans = 1 Use char to convert back to a standard padded character array: strings = char(celldata) strings = Allison Jones Development Phoenix length(strings(3,:)) ans = 13
MATLAB functions for working with cell arrays Functions for Cell Arrays of Strings
set functions with cell arrays of strings Functions for Cell Arrays of Strings http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_prog/ch2_ch15.html
Comparing Strings for Equality Any of the following four functions can be used to determine if two input strings are identical: strcmp determines if two strings are identical. strncmp determines if the first n characters of two strings are identical. strcmpi and strncmpi are the same as strcmp and strncmp, except that they ignore case. Example: str1 = 'hello'; str2 = 'help'; Strings str1 and str2 are not identical, so invoking strcmp returns logical 0 (false). For example, C = strcmp(str1,str2) C = 0 The first three characters of str1 and str2 are identical, so invoking strncmp with any value up to 3 returns 1: C = strncmp(str1, str2, 2) C = 1 String Comparisons
Comparing Strings for Equality (cont.) • These functions work cell-by-cell on a cell array of strings. Let’s consider the two cell arrays of strings • A = {'pizza'; 'chips'; 'candy'}; • B = {'pizza'; 'chocolate'; 'pretzels'}; • Now apply the string comparison functions: • strcmp(A,B) • ans = • 1 • 0 • 0 • strncmp(A,B,1) • ans = • 1 • 1 • 0 String Comparisons (cont.)
Comparing for Equality Using Operators You can use MATLAB relational operators on character arrays, as long as the arrays you are comparing have equal dimensions, or one is a scalar. For example, you can use the equality operator (==) to determine which characters in two strings match: A = 'fate'; B = 'cake'; A == B ans = 0 1 0 1 All of the relational operators (>, >=, <, <=, ==, ~=) compare the values of corresponding characters. String Comparisons (cont.)
Categorizing Characters Within a String There are three functions for categorizing characters inside a string: 1. isletter determines if a character is a letter 2. isspace determines if a character is white space (blank, tab, or new line) 3. isstrprop checks characters in a string to see if they match a category you specify, such as Alphabetic, Alphanumeric, Lowercase or uppercase, Decimal digits, Hexadecimal digits, Control characters, Graphic characters, Punctuation characters, White-space characters For example, create a string named mystring: mystring = 'Room 401'; isletter examines each character in the string, producing an output vector of the same length as mystring: A = isletter(mystring) A = 1 1 1 1 0 0 0 0 The first four elements in A are logical 1 (true) because the first four characters of mystring are letters. String Comparisons (cont.)