470 likes | 567 Views
Addressing Student Difficulties in Using Numeric Data Downloaded from the Web Marilyn K. Pelosi & David L. Russell Western New England College Springfield, Massachusetts. Overview. Three Circumstances under which a student’s ability to download data can be complicated
E N D
Addressing Student Difficulties in Using Numeric Data Downloaded from the WebMarilyn K. Pelosi & David L. RussellWestern New England CollegeSpringfield, Massachusetts
Overview • Three Circumstances under which a student’s ability to download data can be complicated • Requires a level of skill beyond that of “casual student users” • Assumption is that the data is to be subsequently analyzed in a spreadsheet tool.
Steps a student must take • Find appropriate data using a search engine • Download the data in such a manner that they can be converted for analysis • Analyze the data correctly
Challenges • Some data are in a form that is not conducive to direct and effective use in a data analysis tool • Even data made available in a standard format cannot be used without editing • Some data are presented in non-tabular layout that causes their transformation to a standard file format to be very challenging • Data must frequently be downloaded in a text format which students then import the text format into the analysis tool.
Objectives • To describe situations in which some downloaded data is in a form that is not conducive to effective use • To address situations in which data to be downloaded is not available in any standard file format • To address situations in which data must frequently be downloaded in text format and the consequences of this action
Objectives (cont’d) • To address converting the text format into a usable file format • To address the data structures and editing actions required to make the converted data file usable
Three Scenarios • Scenario 1: Web sites which make data available in file formats that can be used directly by data analysis tools, the large majority of which are sites making files available in Excel’s native file format (*.xls) • Scenario 2: Web sites that make data available in text format • Scenario 3: Web sites whose data is not in tabular structure
Scenario 1: Data Available in Standard File Formats • Least challenging scenario • Often However … • Data are not in true tabular form or • Data contain control breaks in the form of sub-totals or total lines or • Data requires editing chores
Example of Scenario 1 • Education Professor has assigned students a project in which they must analyze per-pupil expenditures in MA • Investigate the relationship of per-pupil expenditures and population in the 351 cities and towns in MA.
Challenges to the Student- • Tab sheet is titled Per-Pupil Expenditures FY01 • Data is given for LEA’s (Local Education Agency) rather than by city/town • LEA’s are listed in rough alphabetic order but there are a number of exceptions • The LEA field holds code numbers with many gaps • Data contain several layers of summarization
Information Student Seeks • Per-pupil expenditure is on another worksheet • Labeled “incostlist” and titled FY 01 Integrated Cost Per Pupil • Integrated means the cost of multi-town regional school districts are spread or integrated across the constituent towns. • No indication of which LEA’s are composed of an individual city or town and which are part of a regional district
Integrating Population data • Import from another information source • Student finds this data in which 1999 estimated population data for the 351 cities and towns is listed
Further Challenges • Data are not fully tabular • Contain frequent summarization lines and frequent blank lines for readability • Data are sorted by county and sub-sorted by city or town (unlike the Dept of Education data) • Contain a plethora of data files which are not relevant to the investigation of per-pupil expenditures
An experienced Excel user would • Delete unneeded rows and columns • Make the reference to county a categorical variable • Resort the data by city or town • Copy the data to the “intcostlist” worksheet • Is it reasonable to expect a casual student user to be able to do this?
Here’s is what would need to be done: • 1. Using the Edit – Move or Copy Sheet function, create a temporary copy of the ‘State,County & MCD Totals’ worksheets; • 2. In this worksheet, manually eliminate all spacing lines, county headers, unneeded columns and summary data rows; • 3. Sort the data by city or town, so that they are in the same order as the ‘intcostlist’ worksheet. • 4. Copy these data onto a copy of the ‘intcostlist’ worksheet.
5. In the resulting merged worksheet, the city or town names from ‘intcostlist’ are in column B while the city or town names imported from MISER are in column G. 6. However, the values from the Department of Education and those from MISER often do not match. Thus, the student must create a logical test column (here, column J) to determine where there are any mismatches in the data 7. This logical test reveals an immediate problem: since the IF function’s logical test is case-specific, no values in the original data in column B (which is all in upper-case format, e.g., ‘ABINGTON’) match the values in column G (which is in title format, e.g., ‘Abington’). Therefore the student must modify the logical test in column J to force column G into upper case:
8. Unfortunately, the student soon discovers that even this will not work, because column B’s values contain literal blanks at the end of the entered values. For example the first value is actually 'ABINGTON' followed by 21 blanks. 9. This in turn requires a re-write of the logical test in column J to incorporate the length of the city or town name. For example, row 10 would be: 10. This results in 36 records that are “out of sequence”. These are dominated by sorting errors, that is, an error of incorrect placement of values under the collating sequence, indicating that one or both of the original tables was sorted incorrectly. An example of sorting errors can be seen in values 85-87 (spreadsheet lines 94-96):
Not a criticism of sites • Often not created with analysis in mind • Often originated as an internal document and thus there is no motivation to avoid internal jargon • Web site’s owner’s often have a strong desire to present data in a highly readable format - hence the blank lines, summarization's and other factors • Thus, even a site that offers data in Excel format cannot be used directly by a student
Scenario 2: Data Available in Text Format • Scenario 2a: Text Data available in Tabular Format • Scenario 2b: Data Available in Text Format But with a Layout that Inhibits Downloading
Sites that provide data in text format • 1. Capture the data by selecting and copy-pasting the data to an intermediate holding area such an Microsoft Office’s Notepad™; • 2. Advisedly, saving the data in Notepad’s native format, *.txt; • 3. Take advantage of the analysis tool’s text import functionality, such as Microsoft Excel’s Text Import Wizard; • 4. Edit the imported data as necessary; and • 5. Save the data in Excel’s native format, *.xls.
Example: Suppose a student taking a course on forecasting must analyze time series data. • Home page for the BLS is http://stats.bls.gov • Student selects “get detailed statistics” • For price data - CPI- Average Price Data • Suppose the student selects the price of a pound of bread in the Northeast
Scenario 2b: Data Available in Text Format but with a Layout that Inhibits Downloading • All of issues of Scenario 2a apply here • In addition, the numeric information posted to the Web is solely for the user to read, with no obvious intention of allowing or facilitating the user to download the data • Select carefully the portion of the data to be used on the screen • Copy it and then paste it to the worksheet.
Example - A professor has assigned a student to prepare a graphical representation of population flows in MA by county. Student finds data at the following site: http://eire.census.gov/popest/data/counties/tables/CO-EST2002-04-25.php
To Download these data: • Student must first highlight only the data portion of the table, avoiding header information (column headers, however, are acceptable) • Longer column headers are “wrapped” over several lines, so the first line should be edited to incorporate the full column name • This assumes that the student can: • 1. go into the subsequent lines; • 2. initiate an edit (F2); • 3. cut the data in the line; • 4. move to the first line; initiate an edit in that cell; and • 5. paste the data on the clipboard to the end of the data in the first line.
Handling of a calculated field • In this example: “Natural Increase (births- deaths) and “Net migration” (International + Internal) • These columns get downloaded as entered values • If the data they are based on gets updated, the calculated columns will not get updated and it is unlikely that a student would notice this. • Better if the two columns were eliminated and re-created by the student so they are hot linked to the data • Also note that the first data line is a sum of the detail lines below but is not labeled as such. Likely to through off analysis if not noticed.
Scenario 3: Data not in text format or tabular layout • Websites that have data available • Presentation of data is intended only for reading • Data are not in tabular form
Example: Professor has assigned a paper on the “Deaths Caused by Alcohol” • Student has found the Department of Justice site: http://www.ojp.usdoj.gov/bjs/pubalp2.htm • Selects a report on “Alcohol and Crime” • Selects .txt format
How will the student extract the needed data and move it to Excel for subsequent analysis?
Consider: Occurrence of Alcohol-related deaths • Data of interest in left most numeric column • Copy data to a text editor such as Notepad • Edit out extraneous data, if any • Saved as its own data file • Opened in Excel • Use the Text Import Wizard
Another example: Traffic fatalities and data related to alcohol involvement for the 50 states and DC • Presented in two columns
Challenge • Data file with four fields that is double horizontally • Must be downloaded as a data file of 8 fields • First task - save left hand data as a text file and import it to Excel • Space delimiter works best except for when there is a space in the state name such as New Hampshire
Next • Student must repeat the process for the 26 states to the right • Cut and paste these data to the bottom of the 25 states to the left of the table. • Integer field “Number of fatalities” is fine, but remaining two columns are %’s • Students must store %’s as proportions and then format them in percentage style - otherwise could make serious calculation errors
Implications for Faculty and Conclusions • Faculty, particularly outside IS, often assign students to “get data from the web” • Demonstrated that accomplishing this seemingly straightforward task can be quite challenging • Some faculty may not realize that this task represents more of a challenge to their students than they had in mind • Time spent importing and editing data prior to analysis reduces time spend on the assignment itself
Recommendation #1 • Professors make a habit to test how difficult it is to “get data from the web” • We suspect that many profs have limited experience in doing this • Performing the necessary downloads prior to giving the assignment will allow the prof to assess how realistic it is for students to download the data themselves
Recommendation #2 • Limit the number of assignments that require downloads to a few • After that, we recommend the prof do the downloading and make the data available to students
Implications for colleges and universities • Faculty must educate administrators of the difficulties students face when downloading data from the web • Administrators must make available to students support in using the web to find and download data
Implications for IS community • Data warehouse issues have thus far been taught conceptually • We predict that technical instruction will soon be available • When that occurs, IS faculty will face much the same difficulties