240 likes | 388 Views
Pairwise Alignment, Part II. Reconstructing the Path and Calculating the LCS from the Values and Directions Tables . SETUP: Part II involves using the Visual Basic Editor ( VBE ) to write VBA Function & Subroutines , and creating Buttons .
E N D
Pairwise Alignment, Part II Reconstructing the Path and Calculating the LCS from the Values and Directions Tables
SETUP: Part II involves using the Visual Basic Editor (VBE) to write VBA Function & Subroutines, and creating Buttons. 1. Add the Developer tab to the ribbon. Click the Office button , and then click the Excel Options button at the bottom. In the Popular category, select the Show Developer tab in the Ribbon check box from the options that appear and then click OK. 2. Click the Developer tab in the ribbon whenever you want to access the Visual Basic button in the Code group. Note: You can switch back and forth between workbook environment and the Visual Basic Editor (VBE) environment with the Alt+F11 combination or with the application tabs at the bottom of the Windows screen.
Excel 2007 comes with a new file format: Macro-Enabled Workbook that uses the file extension .xlsm. You can no longer run macros and VBA scripts in regular Excel files, i.e. those that have the extension .xlsx . Therefore you will need to click the Office button , select Save As Excel Macro-EnabledWorkbook to save your spreadsheet so that it will run the scripts that you will now be adding.
Open the Visual Basic Editor (VBE) and select View → Project Explorer (Ctrl+R). You will now see a listing of the 2 worksheets – Values and Directions – in your workbook. Double-click on the Values item to view the Code Section. Visual Basic code uses the worksheet in which it resides as its default location when determining cell references. You will be placing the VB code into the Values sheet.
in the VBA Editor, copy and paste the maxAddress function and 2 subroutines clearLCS and showLCS on this slide and the next into the Code Section of the Values Worksheet. (We will explain what the VB script code does after you get it working) Option Explicit FunctionmaxAddress(The_Range) DimmaxNumAsLong Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell InThe_Range If Cell = maxNumThen maxAddress = Cell.Address EndIf Next Cell EndFunction Sub clearLCS() range("C10:M20").Interior.ColorIndex = xlColorIndexNone Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone range("C4:C6").Value = "" End Sub
SubshowLCS() Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long DimbMATCH, bLEFT, bUPAs Boolean ' CONSTANT declarations UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow BACKSLASH = ChrW(92) 'Hex 5C = Dec 92 CLR_GRAY = RGB(192, 192, 192) ROW_LTR = 9 COL_LTR = 2 VALUES_TABLE_DATA_RANGE = "D11:M20" maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE)) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub
You will now insert 2 buttons in the Values worksheet that will floatin the space above the cells A4:A6. In the Developer tab, select Insert, then choose the Button (Form Control). Draw the outlines of your buttons with the mouse. When the Assign Macro Dialog appears, select Sheet1.showLCS. Name the button showLCS. Create a 2nd button, assign the macro Sheet1.clearLCS and name the buttonclearLCS.
When you click on the showLCS button, if Seq 1 is ATCTGAT and Seq 2 is TGCATA, then the data table cells should be highlighted in grey as in the picture below. The clearLCS button should remove the background color.
Reconstruction of the Path We reconstruct the LCS path by locating the cell containing the largest value in the Values table, then following the arrows back to a cell containing the value zero. If there are several cells with the largest value, all are possible LCS’s. However, in this exercise, we will pick the one that is most distant. If the beginning letter(s) of the 2 sequences did not match, then at this point, we have only reached the beginning letter of one of the sequences (the T in the vertical sequence). In order to reach the beginning letter of the 2nd sequence (the A in the horizontal sequence), we need to continue traveling either left or up.
What does the VBA code DO? What does it MEAN? Option Explicit FunctionmaxAddress(The_Range) DimmaxNumAsLong Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell InThe_Range If Cell = maxNumThen maxAddress = Cell.Address EndIf Next Cell EndFunction Option Explicit: In most programming languages, variables must be DECLARED before they are used. Certain scripting languages, like VBA and Javascript, allow programmers to use variables without declaring them. This can lead to huge problems when trying to track down bugs, since a simple misspelling of a variable name creates a new variable, different from the one intended. The Option Explicitdeclaration is a safeguard measure that will show the programmer a compile error if VBA encounters a variable that has not previously been declared.
Functions and the function maxAddress Option Explicit FunctionmaxAddress(The_Range) DimmaxNumAsLong Dim Cell As range maxNum = Application.Max(The_Range) ' Returns the LAST cell in the range that = maxNum For Each Cell InThe_Range If Cell = maxNumThen maxAddress = Cell.Address EndIf Next Cell EndFunction Function: In VBA, a FUNCTION returns a value, while a SUBROUTINE does not. One implements the return value of a function by assigning the function NAMEa value somewhere in the function’s body. No explicit return statement is used. The maxAddress function 1st uses the common Excel worksheet function max() to find the maximum value in a range and assign it to maxNum. It then uses a for-each loop (analogous to a Java for-each loop) to (a) iterate through all of the cells in the range and (b) return the address (cell reference) of the LAST cell it finds whose value equals maxNum. NOTE: If one wanted to return the 1st cell that the for-each loop found equal to maxNum, one would place an EXIT FUNCTION or EXIT FOR statement at the end of the IF statement.
Subroutines and the subroutine clearLCS Sub clearLCS() range("C10:M20").Interior.ColorIndex = xlColorIndexNone Worksheets("Directions").range("C10:M20").Interior.ColorIndex = xlColorIndexNone range("C4:C6").Value = "" End Sub Subroutine: In VBA, a SUBROUTINE does not return a value, but technically speaking is just like a FUNCTION. Subroutines, however, are generally written to perform actions. In contrast, functions are meant to calculate and return values, but not perform actions. The clearLCS() subroutine clears the background color of all cells in the indicated range. It does so not only in the default worksheet (i.e. the VALUES worksheet, line 1 of the subroutine body), but in the DIRECTIONS worksheets as well (line 2). Line 3 of the subroutine body clears any text in the range C4:C6, although we haven’t yet placed anything in these cells. Later, we will place strings in these cells that will show the alignment and matching bases for the 2 sequences.
The subroutine showLCS SubshowLCS() Dim direction, maxCellAddress, VALUES_TABLE_DATA_RANGE As String Dim r, c, UP_ARROW, LEFT_ARROW, BACKSLASH, CLR_GRAY, ROW_LTR, COL_LTR As Long DimbMATCH, bLEFT, bUPAs Boolean ' CONSTANT declarations UP_ARROW = ChrW(8593) 'Hex 2191 = Dec 8593 Up Arrow LEFT_ARROW = ChrW(8592) 'Hex 2190 = Dec 8592 Left Arrow BACKSLASH = ChrW(92) 'Hex 5C = Dec 92 CLR_GRAY = RGB(192, 192, 192) ROW_LTR = 9 COL_LTR = 2 VALUES_TABLE_DATA_RANGE = "D11:M20" maxCellAddress = maxAddress(range(VALUES_TABLE_DATA_RANGE)) r = range(maxCellAddress).Row c = range(maxCellAddress).Column The code above shows the initial statements for the showLCS() subroutine, that is, the statements that precede the do while loop. The 1st section in showLCS() contains the Dim statements, which is where the local variables are declared. Notice that some of these variables are capitalized to show that they are CONSTANTS, i.e. the values they are first assigned will not change during the execution of the program. The next command uses the maxAddress function to fetch the cell reference of the cell in the Values table that contains the largest value and is closest to the end of the table. Finally, we extract the row and column numbers from the maximum value cell address.
The subroutine showLCS(continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub We’ll now examine the code within showLCS’sDo While Loop. The 1st section in the Do While Loopshades the background color of the last cell In the Values table whose value equals the table’s maximum (i.e. the cell we start with to retrace the path), as well as the corresponding cell in the Directions table. It also retrieves the direction from the latter cell. The direction variable is used to set the values for the 3 boolean variables: bMATCH, bLEFT and bUP, depending upon whether the direction indicates diagonal, left or up. The 2nd part of bLEFT’s and bUP’sbooleanOr expressions is not used until the loop has reached the beginning of one of the 2 sequences.
The subroutine showLCS(continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub If the direction value is diagonal, then bMATCH is true. In this case, we decrement both the row (r) and the column (c) variables so that in the next iteration, the cell we will examine will be the cell located diagonally to the current cell’s top left. If the direction value is left, then bLEFT is true. In this case, we decrement just the column (c) variable so that in the next iteration, we will examine the cell to the left of the current cell. Finally, if the direction value is up, then bUP is true. In this case, we decrement just the row (r) variable so that in the next iteration, we will examine the cell above the current cell.
The subroutine showLCS(continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub ROW_LTR = 9 COL_LTR = 2 ROW_LTR + 1 = 10 COL_LTR + 1 = 3 The subroutine keeps looping as long as c > COL_LTR + 1 Or r > ROW_LTR + 1 COL_LTRand ROW_LTRare the column and row containing the letters of the 2 sequences Hence COL_LTR + 1 and ROW_LTR + 1are the column and row that contain the zeroes. Therefore, the loop will keep iterating until r and c correspond to cell C10 (or cells above it or to its left).
The subroutine showLCS(continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub r = range(maxCellAddress).Row c = range(maxCellAddress).Column Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) IfbMATCHThen r = r - 1 c = c - 1 ElseIfbLEFTThen c = c - 1 ElseIfbUPThen r = r - 1 EndIf LoopWhile c > COL_LTR + 1 Or r > ROW_LTR + 1 EndSub r = 3 ( ROW_LTR + 1 ) c = 5 ( > COL _LTR + 1) If the boolean expressions for bLEFT and bUP consisted only of the 1st condition, then r and c would stop being decremented when either the row or column (or both) reached the cells containing zeroes. When the loop reaches one of the cell’s containing a zero, the direction variable will contain a “0”, not one of the 3 direction values. At this point, we have reached the beginning of one of the sequences (in the table above, the blue sequence). Therefore to continue moving left (to get to the beginning of the red sequence), we need to add an expression that recognizes this case. If we want to move left, that means that c > COL_LTR + 1(since at this point r = ROW_LTR + 1). Similarly, had we wanted to move up, that would mean that r > ROW_LTR + 1.
Reconstructing the LCS Letter Sequence The bases that the 2 sequences have in common are in the cells containing backslashes. In order to compare the 2 sequences, we need to line up the strings next to each other and show both the bases that they have in common, and where they differ. To do this, we will declare 3 string variables: strSeq1, strSeq2 and strLCS. Open VBA. Add the Dim statement below to showLCS() to declare these 3 String variables. Place in the section with the other variable declarations. Dim strSeq1, strSeq2, strLCSAs String
Reconstructing the LCS Letter Sequence (continued) r = range(maxCellAddress).Row c = range(maxCellAddress).Column strSeq1 = "" : strSeq2 = "" : strLCS = "" ' Initialize to empty strings Do Cells(r, c).Interior.Color = CLR_GRAY Worksheets("Directions").Cells(r, c).Interior.Color = CLR_GRAY direction = Worksheets("Directions").Cells(r, c).Value bMATCH = (direction = BACKSLASH) bLEFT = (direction = LEFT_ARROW) Or (direction = "0" And c > COL_LTR + 1) bUP = (direction = UP_ARROW) Or (direction = "0" And r > ROW_LTR + 1) If bMATCH Then r = r - 1 c = c - 1 ElseIfbLEFT Then c = c - 1 ElseIfbUP Then r = r - 1 End If Loop While c > COL_LTR + 1 Or r > ROW_LTR + 1 range("C4").Value = strSeq1 range("C5").Value = strLCS range("C6").Value = strSeq2 Initialize the strings so they are empty (the colon character allows you to put multiple statements on a single line). Place this line just before the Do-While Loop. After the Do-While Loop, add the statements for pasting these 3 strings into the Values worksheet in the 3 cells C4:C6. (Recall that these were the cells on line 3 of the clearLCS() subroutine body.)
Reconstructing the LCS Letter Sequence (continued) If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIfbLEFT Then c = c - 1 ElseIfbUP Then r = r - 1 End If Add the 3 lines above to the body of the If bMATCHstatement. These statements will PREPEND the letter from the sequence into the string. We need to PREPEND each letter encountered because we are building the LCS Letter Sequence in reverse, from tail to head. Note that to capture the letter from the 1st sequence – as we move backwards from column to column, we are always referencing the letter from the ROW_LTRrow (row 9). To capture the letter from the 2nd sequence – as we move backwards from row to row, we are always referencing the letter from the COL_LTRcolumn (column 2). For the string that will show the common letters shared by the 2 aligned sequences, it doesn’t matter whether we use cell(RO W_LTR, C)or cell(r, COL_LTR), because the letters in both of these cells match. Note the importance of placing these 3 new statements BEFORE the lines that decrement r and c; otherwise, we would be extracting letters from the wrong cells!
Reconstructing the LCS Letter Sequence (continued) Click on the showLCS button. You should see the LCS (LongestCommonSubsequence) appear in each of the cells C4:C6. Format C4red and C6blue to visually connect each to their original sequence. Although the text that all 3 contain is identical right now, we will be making changes that will make each distinct. Also format these 3 cells so that they use the Courier New font (size 14 recommended). Courier New is a FIXED-WIDTH font, meaning that all of its characters are the same width (unlike most other fonts). Using this font will prove extremely useful in viewing similarities and differences between the 2 sequences because corresponding letters in the 2 strings will automatically line up vertically.
Reconstructing the LCS Letter Sequence (continued) If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIfbLEFT Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = "-" + strSeq2 strLCS = "-" + strLCS c = c - 1 ElseIfbUP Then strSeq1 = "+" + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = "+" + strLCS r = r - 1 End If Add each of the 3 line code sections above to the If bLEFT and If bUPstatements, exactly as you just did with the bMATCH statement. Recall that when there is a match, we captured the shared letters in all 3 strings. When we travel LEFT, changing columns, we want to copy the letters from Sequence 1 in the ROW_LTR (9th) row into strSeq1. At the same time, we need to place a character in both strSeq2 and strLCS to preserve the alignment so that corresponding characters in all 3 strings will coincide. Therefore we add the "–" character to indicate a deletion.
Reconstructing the LCS Letter Sequence (continued) If bMATCH Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = Cells(r, COL_LTR).Value + strLCS r = r - 1 c = c - 1 ElseIfbLEFT Then strSeq1 = Cells(ROW_LTR, c).Value + strSeq1 strSeq2 = "-" + strSeq2 strLCS = "-" + strLCS c = c - 1 ElseIfbUP Then strSeq1 = "+" + strSeq1 strSeq2 = Cells(r, COL_LTR).Value + strSeq2 strLCS = "+" + strLCS r = r - 1 End If When we travel UP, changing rows, we want to copy the letters from Sequence 2 in the COL_LTR (2nd) column into strSeq2. At the same time, we need to place a character in both strSeq1 and strLCS to preserve the alignment so that corresponding characters in all 3 strings will coincide. Therefore we add the "+" character to indicate an insertion. The indel (insertion/deletion) convention that we’ve implemented corresponds to a method for CONVERTING Seq 1 to Seq 2. Had we wanted an implementation that would have converted Seq 2to Seq 1, we would have reversed our usage of the + and - characters.
Reconstructing the LCS Letter Sequence (continued) When completed, clicking on the showLCS button when Sequence 1 = ATCTGATand Sequence 2 = TGCATAwill show a pairwise alignment as in the image to the left. strSeq1 strLCS strSeq2 Notice the usage of the plus and minus characters. To CONVERTSeq 1to Seq 2, we need to delete the 1st three characters of Seq 1(ATC), hence 3 minus signs appear at the beginning of strLCS (and at the beginning of strSeq2 to align it with strSeq1). When we reach the 6th and 9th positions, we need to insert two characters (C and A), hence our use of the 2 plus signs.