250 likes | 466 Views
Importing Data. Excel. Text Data Parsing Scrubbing Data. June 21, 2012. Using “String” Functions to Scrub Data.
E N D
Importing Data Excel Text Data Parsing Scrubbing Data June 21, 2012
Using “String” Functions to Scrub Data When importing data from an external source, it is important to consider the data may not have had any kind of data scrubbing or cleaning to insure it was keyed in a proper format. GIGO – Garbage In – Garbage Out For Example: Planning, 53 SuPPort, 95 JoHn SMITH Mary Johnson carol ennen Smith, Larry
Importing Data Files • Excel has gotten “smarter” in its ability to open files of data that are “Delimited” or where fields are defined in fixed positions. • It “Knows” leading spaces in front of text and numbers are probably not correct and strips them out. • If you OPEN a TEXT file in Excel, it will start the Text Import Wizard. • You can Override the field parsing by telling the Wizard the data is Fixed Fielded and then define the field to be the entire length of the data. • You are prompted for where you want to import the data to and can also change some settings and attributes.
EXERCISE: Problem Statement: IT_SKILLS_STEPS You are to create an Excel Chart from the data on the following slide. You could CUT and Paste this data into Cell A1 of a Spreadsheet or use Excel to Open the file using the: [DATA] tab and {Get External Data} For the first example, I don’t want to use the Text Import Wizard. The data is also saved as a .TXT file called: IT_Skills_Data.txt There are 4 “problems” with this data. The “Correct” format was have ONE space after the Comma (,). It does assume that there is a comma after the description and before the number (But this might also be a point of a possible data error that might need to be checked in future problems)! You will use String and Text Functions FIND MID VALUE LEN IT_Skills_Steps IT_Skills_Data.TXT
IT_Skills_Data.TXT Project / Program Management, 60% Business Process Management, 55% Business Analysis,53% Application Development,52% Database Management, 49% Security, 42% Enterprise Architect, 41% Strategist/Internal Consultant, 40% Systems Analyst, 39% Web Services,33% Help desk / User Support, 32% Networking, 30% Website Development, 30% QA/Testing, 28% IT Finance, 28% Vendor Management / Procurement, 27% IT - HR,21% Other, 3%
STRING and TEXT FUNCTIONS LEN(string) Return the number of characters (Length) in a string FIND(Target, InString [,StartPos]) Look for a Target in a string of characters starting at the optional parameter of StartPos and return the position. MID(String, StartAt, #_of_Characters) Take a String of characters and begin at the StartAt position and extract #_Of_Characters. (The #_of_characters may be a value larger than what is actually there) VALUE(string) Take a sting of digits and convert it into a number format
STEPS TO PARSE DATA 1) Cut the text from: IT_SKILLS_Data.txt 2) Open Excel and Paste the data into cell: A2 3) In A1, Type the column heading: IT SKILL 4) In B1, Type: COMMA (This is a temporary value to be used in Step#9) 5) In B2, Type the formula: =FIND(",",a2) 6) Copy the formula in B2, down thru B19 7) In C1, Type: VALUE 8) In C2, Type: =VAL(Mid(A2,b2+2,Len(a2))) 9) In C2, format the column for Percent% 10) Copy the formula in C2 down thru C19.
11) Warning: THERE ARE 4 PROBLEMS WITH THE DATA THAT WILL NEED TO BE fixed! The data should be in descending order - Can you find and correct the data in Column "A" so the data in C2 is correct? 12) Hide column B: 13) Highlight data in the range: A2:C19 14) [Insert] a <BAR> Chart 15) Adjust the size of the graph to show all descriptions 16) Add Titles and format the Chart to be pretty! You might want to redo the assignment and when you get to step 11, rather than "FIX" the data, you can get fancy with nested functions. What if there were 1000’s of data lines? It is not efficient to manually change and update data. WORK SMART / NOT HARD The nested set of functions that fixes the errors and does everything in one step: In D2: =VALUE(TRIM(MID(A2,FIND(",",A2)+1,LEN(A2)))) Format the cell as Percent and copy it down to D19. Compare columns C and D.
ExerciseData Parsing and Scrubbing The competed workbook exercise is called : IT_Skills There are 3 Sheets: {Raw Data}, {Parse} and {Final} that show the solution at various stages. THERE IS A 2nd data file called IT_Skills_Data2 with different data that can be used to bring into the workbook and test your process.
Exercise: Use the [Data] {Get External Data} This is the same example but rather than CUT /PASTE the text it will use the FROM TEXT option to get the data.
{Get External Data} FROM TEXT An OPEN file Dialog box will be presented and only show .TXT files for selection. After selecting the file, Sample records are displayed to apply the Parse Pattern You can also specify what ROW you want to start importing from.
Since this data fields are delimited by a comma, Select the COMMA option and notice the PARSE line. Then press <Next> CSV Comma Separated Values CDF Comma Delimited Files
You have the Option to modify the DATA TYPE for different fields: General, Text, Date, or even to Omit a column from import.
The final option is to specify WHERE to import the data in the workbook. Since we “Know” we want to add headers to the data, place your cursor in A2 . There is also an advanced {PROPERTIES} tab where you can specify other attributes about the import
Excel will even “remember” the attributes and dialog steps you just completed so the next time you select a file, it will apply the same steps to parse the data. Use the {DATA Refresh} option to specify a new file to load. Another TEXT file to load is called: IT_Skills_Data2.txt
Get Data FROM EXISTING Connections You may have a workbook that pulls in data from another source to be used to update a Chart or you want to do something else with it. Sort – Filter- Report – Summarize - etc The Example: Get_Student_Grades is a workbook that LINKS and LOADS the StudentGrades workbook.. ANY CHANGES MADE WILL NOT BE MADE IN THE ORIGINAL FILE
Get Data From a Website It is also possible to link your spreadsheet to get data that is saved on a website The FIileName IS CASE SENSITIVE There is an Excel Spreadsheet saved as a Web Page: htm called: http://www.tomboulian.info/Names_As_Web.htm You specify the selection of data you want to bring in by clicking on the Yellow Arrow Tab. YOU CAN TRY LINKING TO OTHER SITES TO IMPORT DATA FROM
Exercise Data Parsing and Scrubbing The competed exercise is called : Parse_Names You will use String and Text Functions & PROPER TRIM LEFT FIND MID Bad _Data_Names_2
STRING and TEXT FUNCTIONS & Used to concatenate (JOIN) strings of text together PROPER(String) Convert the 1st letter of each word to a Capital letter and all the remaining letters in the word into lower case. TRIM(String) Remove all duplicate spaces from anywhere within a string. LEFT(String, #_of_Characters) Take the LEFT most #_of_Characters from a String. It is like the MID function but starting at the first position. MID(String, StartAt, #_of_Char) Take a String of characters and begin at the StartAt position and extract #_Of_Char. (The #_of_char may be a value larger than what is actually there)
Importing Data ExcelEnd of Section Text Data Parsing Scrubbing Data