210 likes | 345 Views
Lab 4. MBAC 611. Portions of this lab are based on data & notes from Hadley Wickham (http://stat405.had.co.nz/). Lab Preparation. Create a lab4 folder in your private network folder. Download the data.zip file from Moodle and save it to your lab4 folder.
E N D
Lab 4 MBAC 611 Portions of this lab are based on data & notes from Hadley Wickham (http://stat405.had.co.nz/)
Lab Preparation Create a lab4 folder in your private network folder. Download the data.zip file from Moodle and save it to your lab4 folder. From within Windows navigate to your lab4 folder. Right-click on the data.zip folder and select “Extract All”. This should create a data folder that contains the datasets for this lab.
Data Cleaning Frequently the data sets you receive will not be in an ideal format for analysis. You may need reformat the data and fix errors. In this lab we will look at some methods to find and fix problems in your dataset.
Setting Your Working Directory Start Mathematica Use the SetDirectory[] function to the data folder in your lab4 folder. To make sure you are in the right folder use the FileNames[] function. You should see the following:
Examining the Files Execute the following function: FilePrint["test1.csv"] This function displays the content of the specified file (test1.csv). You should see the following:
This file is a standard CSV file. Note that the first line contains the header. According to the header there are five fields. This means we expect each record (row) to contain five elements - each separated by a comma. However, we should verify that the data indeed conforms to this expected structure.
Import Options Mathematica allows you to customize how files are imported. One of these options allows you to automatically remove the header when importing data. Execute the following function: Import["test1.csv", "HeaderLines" ->1]
You should see the following output: The "HeaderLines" ->1 option tells Mathematica to ignore the first line as it contains header information – not data. The number following the -> indicates the number of rows that should be skipped when reading in the file. Assignment 1 Assign the result of the previous Import function to a variable named test1.
Each record is stored in its own list. All imported records are members of one big list – a list we just assigned to a variable named test1. Therefore the number of elements in test1 indicates the number of records. As you will recall, we can determine the number of elements in a list using the Length function. Execute the following function: Length[test1]. The result should be 10.
Checking Records We should check that every record has the correct number of attributes (field values). In this case the correct number is five. Lets check the first record – enter the following expression: Length[test1[[1]]] The result should be 5.
First element of the list List name We would like to do this check for each record. One way to do this is using the Do function we used in our previous lab. As you may recall the Do function has the following syntax: Do[body, {i, min_value, max_value}]
Defining A Function We will create our own function that will display an error message if a given row doesn’t have a specified number of attributes. Enter the following expression: rowCheck[x_]:=If[Length[test1[[x]]]!=5,Print["Record ",x]]
Define a new function named rowCheck. It takes one argument named x_. Use the If function to test the Length of the specified record. If the record length does not equal (!=) to 5 then display (Print) the number of the record (x). We can try our new function on the first row of test1. Enter the following expression: rowCheck[1] You won’t see any output as the first record’s length is five. The same will be displayed if we try it on any of the records in this dataset.
Assignment #2 Scroll back to where you defined the function and change the number 5 to 4 and re-evaluate the function definition (press shift-enter after making the change). You should see the following: Now execute the following expression: rowCheck[1] You should see the following output: This perfectly matches the arguments of our Print function - so it looks like our function is working.
Do Loop We would like to execute the rowCheck function against every record in test1. We can accomplish this with a Do Loop. Assignment #3 Undo the change we made to the rowCheck function (change the 4 back to 5). Remember to re-evaluate the function.
Enter the following expression: • Do[rowCheck[i],{i,1,Length[test1]}] • You won’t see any output from this function execution because all rows have five attributes. • Assignment #4 • Import the CSV file named test2.csv. • Assign the records list to the variable named test2. • Make sure there are 10 records in the file. • Modify the rowCheck function to check elements of test2. • Using the Do function check that all records contain five attributes – records 3 and 6 do not. • Display the third and sixth records (hint: test2[[3]] to view 3rd record).
DAT Files Sometimes you have a file that has an unusual field or record delimiter. In this case we may simply refer to the file as a “dat” file – a generic file with tabular data. Mathematica allows you to customize how this type of file is read. More info on this type of file and its options can be found at the following URL:http://reference.wolfram.com/mathematica/ref/format/Table.html
We will take a look at a “.dat” file that uses the “|” (vertical bar) as a field separator. Enter the following expression: FilePrint["test3.dat"] You should see the following output: The first record contains the header. Note the | separator between fields.
Importing Dat Files Enter the following expression: Import["test3.dat","FieldSeparators“->"|"] The following output should appear:
Assignment #5 • Modify the previous Import statement such that the header line is ignored (not imported into the data list) • Assign the resulting list to the variable test3. • Assignment #6 • Import the file test4.csv such that the header line is ignored (not imported into the data list) • Assign the resulting list to the variable test4. • Rewrite the rowCheck function such that the record number will be displayed if the 3rd field of the record specified by x_ is greater than 20. • Using the Do function, and the above rowCheck function, check that the third field of every record is less than 20. The record number of any record violating this rule should be displayed. • The output should be
Submit Lab 4 Save your lab as a notebook file. Remember to save the file to your lab4 folder in your private network folder. Submit the notebook to the lab4 submission link in Moodle.