380 likes | 483 Views
Using Microsoft Excel Automation to Speed and Streamline Data Formatting and Analysis. G. James Lemoine, Jr. Scheller College of Business Georgia Institute of Technology. Welcome to “Learning to Program in 2 Hours”. Macros are mini-programs that you can easily create in Excel
E N D
Using Microsoft Excel Automation to Speed and Streamline Data Formatting and Analysis G. James Lemoine, Jr. Scheller College of Business Georgia Institute of Technology
Welcome to “Learning to Program in 2 Hours” • Macros are mini-programs that you can easily create in Excel • They can make Excel do just about anything, from building spreadsheets from scratch to analyzing data to manipulating and deleting files from your coworkers' hard drives. • We won't be covering that last one. • Let's talk about some sample macros and take a look at how they work.
Here’s a Simple Example Say that every time a manager receives a new sales number, she has to copy it from one column, paste just the value (not the formatting) into another, format it as currency, and add some formulas. Also, if the day's sales are below a certain amount, the date has to go into a separate "watch list" column. She does this a lot, but she'd do it faster if she made a robot do it for her. Open Simple_Sales_Spreadsheet. Put a value in cell A9. Select cell A9. Hit Ctrl+W.
Now for a More Complicated Example Close the last spreadsheet. Say that you deal a lot with stock quotes, and every morning you need to see a some basic information. Open the "Stock_Puller" file Looks like somebody got the symbol for AT&T wrong – it's not ATT, it's just T. Go ahead and change it. You can add more symbols to the bottom, too. Note that you can use data validation to change what information you pull from Yahoo (in row 6). Close this for now.
And now for a Very Complicated example Close that file and open Survey_Automation This is a very complicated set of macros I built for Georgia Tech’s Institute for Leadership and Entrepreneurship to take raw survey data inputted by undergrads and convert it to an SPSS-ready format. Read the instructions to see how it works. This took about a week to build, but saved us about 3 months of work manually inputting data and building formulas. This macro literally does in seconds what it was taking us weeks to format and analyze.
But How Do You Do This? Close any open files. Open a new spreadsheet. Now it's your turn. But first: do you see the "Developer" tab? You may or may not. If you don't: Go to the File tab Click "Options" on the left (right above "Exit") Click "Customize Ribbon" In the rightmost list, make sure that there's a check next to "Developer". Hit Okay. The Developer tab gives you shortcuts to this stuff.
No, Really, How Do You Do This? On the Developer tab, click "Record Macro." Choose a name for your macro (I called mine “Kermit") and then assign it a keyboard shortcut you wouldn't normally use. (Ctrl+W?) Note that you can store the macro in your Personal Macro Workbook or in this workbook… just put it in this one. If you put it in your personal macro workbook, though, you can access it in ANY Excel file you use. Otherwise it's limited to this file… which is usually (but not always) the way you'll want it.
Time to Program! Hit OK. Now do something. Add text. Write formulas. Format your cells. Hide a column. Copy and paste something. Whatever. Have fun. But only for 15 seconds. Now push the "Stop Recording" button on the Developer tab. Erase everything you've done, unhide anything you hid, etc. Make it like it never happened. Run your macro, either with the Macros button or with your keyboard shortcut.
That's the Easiest Way And it's pretty neat. But you can also build macros from scratch, without recording. You can also see the actual programming code of the macros you've built (or recorded) and edit it. On the Developer tab, hit the Macros button, then choose to Edit the macro you built earlier. Look complicated? Some of it is, but most of it is fairly straightforward.
Ready to Program? Do exactly what I say. Close open sheets. Open a new one. Record a new macro. Give it a keyboard shortcut, name it whatever you want. Hit OK. In cell A1, type 24 and press enter. In cell A2, type 26 and press enter. In cell A3, type =SUM(A1:A2) and press enter. Select cell B3. Type "This is the sum" and press enter. Now press the stop recording button.
Let's Look at What We Coded Now press Macros and let's go edit your macro. Note that Excel seems to write code in the most complicated way possible when you record a macro – when we actually code ourselves, we'll use simpler syntax. Sub MacroName ( ) tells you the name of the macro. This is where it starts. End Sub is where it ends. Those green lines with an apostrophe are Comment lines - not code, just notes for you to leave behind. ActiveCell.FormulaR1C1="24" just means that your macro typed 24 in the active cell (which was A1). Range("A2").Select says that you moved the selection cursor to A2. What does ActiveCell.FormulaR1C1="=SUM(R[-2]C:R[-1]C)" do?
Given our limited time… Let’s run through a practical example of Excel coding. The “RWG Calculator” file contains a set of macros I built to handle two things that typically took up too much of my research time: Replace “Missing value” codes from SPSS with blank spaces so that Excel can work with the data Automatically calculate RWG(j) statistics for different distributions Let’s walk through how these macros for data formatting and analysis work. For more on RWG statistics and agreement in general, I highly recommend “Answers to 20 Questions about Interrater Reliability and Interrater Agreement” by LeBreton and Senter, ORM 2008.
Here’s How it Works We’d copy our data (without header rows) starting on Row 21. I’m going to use the sample data from the sample tab. The sample data has data for fifty cases on four 7-point items. Excel doesn’t understand missing data markers like SPSS and other statistical programs do, so we need to replace those with blank spaces. Most people would do this by hand, which can take a long time in big datasets. Pressing Control-W runs an automatic program that does this for you, according to whatever missing value code you put in cell B6.
Here’s How it Works With that done, Excel can compute the agreement statistic for this data for you when you hit Control-R. Note that it also provides you with the means and variances for each item in rows 12 through 14. A normal distribution may not be the most realistic assumption for survey data, though, so you can click in cell H3 and use the drop-down box to change the distribution assumption, likely to something a little more skewed. Then hit Control-R to recalculate the rwg statistic.
Here’s How it Works Finally, you may want to clear the sheet so that you can input a new set of data. Pressing Control-E will erase all of the data and results, but not the basic template. Key point: You cannot undo a macro! I coded all of this on a Monday morning. Like just about anything statistical we do, if you practice a bit, it doesn’t take long to make the numbers dance. Pretty much all of the basic tools are in this code to have Excel do whatever you want. Let’s explain the code.
The code On the Developer tab, hit the Macros button. Now let’s edit the SheetErase macro (we start with this one because it’s the simplest). You can also get here by pressing the Visual Basic button. Note on the left side of the screen that macros are kept in modules within workbooks. Workbooks without macros don’t have modules, but you can add them with the Insert menu at the top of this screen. You can keep all your macros on one module, or put them each on different ones – it doesn’t really matter. I put all three in the same module in this example.
Sub SheetErase – the For statement The first line, Sub SheetErase(), merely starts the program and gives it a name (SheetErase). End Sub, a few lines down, ends it. For ColumnCounter=3 To 1000 is the start of a For/Next loop, which many of our statistical programs also use. This basically tells Excel it’s going to count from 3 to 1000 in that ColumnCounter variable, repeating all of the lines of code between the For and the Next for each count. So Excel is going to repeat those four lines of code 998 times, automatically!
Sub SheetErase – Cells references Cells(12, ColumnCounter), in the next line, is a reference to a specific cell on the worksheet. This is backwards from what you might expect: the first number is the cell row, and the second number is the cell column, which Excel usually displays as a letter. So Cells(1,1), for instance, would be cell A1. Cells(2,1) would be A2, and Cells(3,1) would be A3. In the If statement on the third line, Cells(12, ColumnCounter) refers to a cell on row 12, in whatever column the counter is on (remember, ColumnCounter is counting from 3 to 1000.
Sub SheetErase – If statements If Cells(12, ColumnCounter) = “” Then Exit For, is a basic If statement like many of our programs have. Key Point: All If’s must have a Then! It’s not like SPSS! Excel looks in the selected column (starting with 3, up to 1000) at row 12 to see if it’s blank. If it’s blank, then it knows it’s done, it’s reached the end of the data, so it doesn’t need to keep counting to 1,000. So Exit For breaks it out of the For Next loop. If this doesn’t make sense yet, it should after the next slide.
Sub SheetErase – For/Next continued The next three lines only get run if Excel hasn’t “Exitted the For,which it only would have done if the column was blank. Since the column isn’t blank, and since this is an erasing program, this is what does the erasing! These three lines erase the entries in the current column in rows 12, 13, and 14. Then the Next adds one to ColumnCounter and sends the program back up to the For statement to start over on the next column. You can use For, Next, and If to automate just about any data formatting you can imagine.
Sub SheetErase – Cell selection You can probably figure out the next five lines of the code. Cells(16,4)=“” erases cell D16. Cells(16,7)=“” erases cell H16. Rows(“21:10021”).Select selects rows 21 through 10021, just as if you’d selected them yourself with the mouse. Selection.Delete deletes what you selected. Cells(21,1).Select selects cell A21 so that the spreadsheet is ready to paste in new data. That’s it!
Sub MissingData – If and MsgBox Scroll up to the top of the module to find the MissingData subroutine – it’s a bit more complicated. The opening If statement checks to see if cell B3 (the # of items in the data) is a non-blank number. If it’s not a number, or if the cell’s blank, the macro doesn’t have the information it needs to do anything. So if it finds this problem, it displays a helpful message box telling the user where the problem is. Then the Exit Sub ends the program without doing anything else. Note the End If below that. You can put a block of statements after an If, if you want Excel to do more than one thing. But then you have to put an End If in to tell Excel when it should start doing things regardless of the If statement again.
Sub MissingData – If and Exit Sub The same thing is going on in the next 4 lines, which begin with If Cells(6,2)=“” Then. Excel checks to see if something has been entered for a missing data marker in cell B6. This program replaces missing data markers with blanks – so if there is no missing data marker, there’s nothing for it to do! In that case, it displays a message and ends the macro. These are great to put at the beginning of the macro to check that any assumptions are met.
Sub MissingData - Variables In the next 3 lines, we’re establishing new variables. LastColumn=Cells(3,2) takes whatever’s in cell B3 (should be our number of items) and puts it in a variable called LastColumn (I named it that because however many items there are should also be how many columns there are in the dataset, since each column is an item). LastRow = 0 sets this variable equal to zero, since Excel doesn’t know what the last row of data is yet. MissingValue=Cells(6,2) takes whatever’s in cell B6 and puts it in the variable called MissingVariable.
Sub MissingData – Nested For’s A little more complicated – the next two lines nest two For statements. For RowCounter=21 to 10000 starts a new variable called RowCounter, and tells it to count from 21 to 10,000 (because I want it to start at row 21 and I figure there won’t be more than 10,000 rows of data). For ItemCounter=1 to LastColumn starts a new variable called ItemCounter (which I’ll use to count through columns), and tells it to count from 1 to whatever the last column with data is. So it counts through every column of every row in the data: this is how you automate Excel doing something to every cell in your data.
Sub MissingData – Finding the last row In other words, with these two For/Next loops, we can have Excel look at cell A21, then B21, then C21, until the end of the row, then it will move on to A22 and repeat. It will keep going until row 10,000 unless we tell it otherwise. The point of this block of code is actually to automatically find the end of the data – I assume here that the end of the data will be the first fully blank row it finds. The If statements within these two For/Next loops are basically looking for a fully blank row.
Sub MissingData – More If’s The For RowCounter loop will go through rows 21 to 10000, while the For ItemCounter loop will go through each column within each row. In order, If Cells(RowCounter,ItemCounter)<>”” Then checks to see if the current cell (in the current row and column) is blank. If it’s not blank, then there’s data there, so this couldn’t possibly be the last row. Thus, the Exit For skips out of the ItemCounter For/Next loop. The ItemCounter loop goes through every cell in a row, but since we know this isn’t the last row, we don’t need to spend more time here. This moves it on to the next row.
Sub MissingData – More If’s The next line is If ItemCounter=LastColumn Then. This asks, are we on the last column (or last item) of the row (or case). Remember, the If before this would have kicked us out of this row if it found any blank cells on the row. So if we made it this far, with all cells blank and we’re on the last column, then the row must be blank. If the row is blank, then the previous row, the row above it, should have been the last row of the data. So, LastRow = RowCounter-1 tells Excel to subtract one from our current Row number, and that’s the number of the last row with data in it.
Sub MissingData – Found the Last Row Note that a few lines up we set LastRow = 0. So until Excel finds the last row, that variable is still zero. The next If statement, If LastRow <> 0 Then, checks to see if the LastRow has been changed from that zero. If it has, then we have the last row in the variable and don’t need Excel to keep counting until row 10,000! So the Exit For bumps Excel out of the For/Next RowCounter loop and tells the program to proceed with the code after the Next RowCounter.
Sub MissingData – Missing data search The next two For/Next loops again tell Excel to first check every row, then check every column within that row. Note the For RowCounter=21 to LastRow now tells Excel not to count from 21 to 10,000 like before, but from 21 to whatever row number it figured out was the last row. The IF statement can be translated like this: “If the string value of the current cell (current row and column) is equal to whatever’s in the MissingValue variable, then set that cell to blank. That’s the whole code!
Sub RWG – More of what you’ve seen The Sub that creates the actual RWG’s starts out with a lot of what we’ve already seen. The first three If blocks check to make sure acceptable values are in the input blocks, and make sure a valid distribution has been selected from the drop-down box in H3. After that is a copy from the macro above of the code to determine where the last line of data is (this may at first seem inefficient, but it only takes a fraction of a second to run).
Sub RWG – Another For/Next Below that is the code that calculates the averages and variances for each column. Each column is an item. Since we’re doing the same thing with each column, we use another For/Next loop: For ItemCounter=1 To LastColumn counts from the first column to the last one. The next If statement block through the End If is a bit complicated – it takes the current column number (ItemCounter) and turns it into the corresponding letter (1=A, 2=B, etc.). This is necessary because we’ll need that letter for our formulas.
Sub RWG – Putting values in cells The next three lines put a header and two formulas in cells. The first line starts with Cells(12, ItemCounter+2) = will put whatever’s after the equal sign into the cell designated. The 12 says it’s row 12, the ItemCounter+2 says to take the number of the item and add 2 to get the column. For instance, when ItemCounter is 1, it will put the text in the third column (check the actual worksheet to see why) and the 12th row – that’s C12. = “Item #” & ItemCounter concatenates the text “Item #” with the value in the ItemCounter variable.
Sub RWG – Getting the average of variances Now that we have the variances for each item, we need the average of those variances to calculate an rwg statistic. There is one column for each item, starting with the third column, so the last column will be the number of items plus two (since it starts on the third column instead of the first). Thus, we need the letter reference for the total number of items plus two. That’s what the next block does. This is followed by the formula that averages the variances: “=AVERAGE(C14:” & ColumnLetter & “14)”
Sub RWG – Getting the expected variance The last thing you need to calculate the rwg statistic is an expected variance – what would the variance be expected to be in a random distribution? There are set recommended numbers for this based on how many alternatives are in each item, and what distribution is expected in your data. Both of these are variables that were input, stored in Alternatives and Distribution. Thus, the following If statements use the number of alternatives and expected distribution to put a number in the ExpectedVar variable (these numbers from LeBreton & Senter, 2008).
Sub RWG – The RWG formula itself Now all that’s left is to put the actual RWG formula in cell G16 (or Cells(16,7)). This formula has a lot of concatenation, using the variables that have been computed along the way to build a final formula. That’s it! The RWG is calculated!
A few more notes Macros are saved along with the file you’re working on. It’s best to save as a macro-enabled workbook (.xlsm) so that people know your macro is intended and legitimate (not a virus). If you don’t know the code for something you want to do, Record a Macro, do it, and then go look at the code. You can press the Visual Basic button on the Developer tab to go directly to your macro modules. All it takes is one misspelled word to get an error. If you get one and go to Debug mode, you’ll need to push the blue square (stop) at the top of the screen to be able to run macros again.
Thanks for listening! If you have more questions, Feel free to e-mail me at Jim.lemoine@scheller.gatech.edu