580 likes | 580 Views
Learn the best practices for creating and managing data and spreadsheet files, including data entry options, data manipulation techniques, and analysis workflows. Discover how to recognize inconsistencies in datasets, identify data entry tools, and implement validation measures for accurate data integration. Enhance your data organization skills and gain insights into creating valid and structured datasets.
E N D
Data Entry & Manipulation GEO 802, Data Information Literacy Winter 2019 – Lecture 3 Gary Seitz, MA
Lesson 3 Outline • Best practices for creating data & spreadsheet files Data entry options Data manipulation options Analysis and Workflows Luis Prado from The Noun Project
Learning Objectives • Recognize inconsistencies that can make a dataset difficult to understand and/or manipulate • Identify data entry tools • Identify validation measures that can be performed as data is entered • Review best practices for data integration • Describe the basic components of a relational database
Goals of data entry • Create data sets that are: • Valid • Organized to support ease of use CC image by Travis S on Flickr
Collecting data yourself This means gathering data and entering it into a database or a spreadsheet – whether you work alone or collaboratively
Structured data If you want your computer to process and analyse your data, a computer has to be able to read and process the data. This means it needs to be structured and in a machine-readable form.
Machine-readable data • is data (or metadata) which is in a format that can be understood by a computer. • Human-readable data that is marked up so that it can also be read by machines Examples: microformats, RDFa • Data file formats intended principally for machines (RDF, XML, JSON).
Unstructured data Unstructured has no fixed underlying structure. E.g. PDFs and scanned images may contain information which is pleasing to the human-eye as it is laid-out nicely, but they are not machine-readable.
Example: poor data entry From a small mammal trapping study: • Inconsistency between data collection events • Location of Date information • Inconsistent Date format • Column names • Order of columns
Example: poor data entry • Inconsistency between data collection events • Different site spellings, capitalization, spaces in site names—hard to filter • Codes used for site names for some data, but spelled out for others • Mean1 value is in Weight column • Text and numbers in same column – what is the mean of 12, “escaped < 15”, and 91?
European Spreadsheet Risks Interest Group (ESRIG) Read oneofthesestoriesandtellus, whatexactlywentwrong
Best practices • Columns of data are consistent: • only numbers, dates, or text • Consistent Names, Codes, Formats (date) used in each column • Data are all in one table, which is much easier for a statistical program to work with than multiple small tables which each require human intervention
Best practices • Create descriptive column names without spaces or special characters • Soil T30 Soil_Temp_30cm • Species-Code Species_Code • avoid using -,+,*,^ in column names. Some software may interpret these symbols as an operator • [REMINDER] -> Use a descriptive file name • For instance, a file named • ‘SEV_SmallMammalData_v.5.25.2010.csv’ • indicates the project the data is associated with (SEV), the theme of the data (SmallMammalData) and also when this version of the data was created (v.5.25.2010). This name is much more helpful than a file named mydata.xls.
Best practices • Missing data • Preferably leave field empty* (NULL = no value) • In numeric fields, use a distinct value such as 9999 to indicate a missing value • In text fields, use NA (“Not Applicable” or “Not Available”) • Use Data flags in a separate column to qualify missing value M1 = missing; no sample collected E1 = estimated from grab sample * This is totally up for debate, and is largely discipline- and software-specific.
Best practices • Enter complete lines of data Sorting an Excel file with empty cells is not a good idea!
5 problems: Headers should be in a single row. Do not embed charts, graphs, or images Do not leave empty rows or columns Do not leave empty cells Avoid the use of special characters
3 problems: Do not merge cells Do not use commas Do not use colored text or cell shading
2 problems: Do not mix data types in a single column Do not embed comments
Introduction to cleaning data Section1: Nuts and chewing gumlooksatthethewaydataispresented in spreadsheetsandhowitmightcauseerrors. Section2: The Invisible Man is in your spreadsheetisconcernedwiththeproblemsofwhitespacesand non-printablecharactersandhowtheyaffectourabilitytousethedata. Section3: Your data is a witch’s brewdealswithconsistency in dataentry, andhowtochoosetherightunitandformatfordata. Section4: Did you bring the wrong suitcase (again)?isaboutwheretoputdata, andhowtostructure it. Accompanyingthesesectionsis a step-by-step recipeforcleaning a dataset. This is an extensive, handbook-style resourcewhichwereferto in eachsection. Ittakes a setof ‘dirty’ dataandmovesitthroughthe different stepstomakeit ‘clean’.
References • Best Practices for Preparing Environmental Data Sets to Share and Archive. September 2010. Les A. Hook, Suresh K. SanthanaVannan, Tammy W. Beaty, Robert B. Cook, and Bruce E. Wilson. http://daac.ornl.gov/PI/BestPractices-2010.pdf
Types of “bad data” • Incorrect data • Inaccurate data • Business rule violations • Inconsistent data • Incomplete data • Nonintegrated data
Incorrect data • For data to be correct (valid), its values must adhere to its domain (valid values). • For example, a month must be in the range of 1–12, or a person’s age must be less than 130. Taken From: ADELMAN, S., ABAI, M., & MOSS, L. T. (2005). Data strategy [...] [...]. Upper Saddle River, NJ [u.a.], Addison-Wesley.
Inaccurate data • A data value can be correct without being accurate. For example, the city of London and web country code for France “.fr” are both accurate but when used together (such as London, France) the country is wrong because the city of London is not in France, and the accurate country code is “co.uk”
Nonintegrated data • Data that has been created separately & not with the intention of future integration. • E.g. customer data can exist on 2 or more outsourced systems under different customer numbers with different spellings of the customer name & even different phone numbers or addresses. Integrating data from such systems is a challenge.
Inconsistent data • Uncontrolled data redundancy results in inconsistencies. Every organization is plagued with redundant and inconsistent data. • For example names or places: “Smith, David” might also sit alongside “David Smith”. “London, UK” and “London, England”.
Incomplete data Data that might include elements such as Names, postal code, gender, age, AHV number might also only capture haphazardly elements such as ailment, GP nameor even incomplete date of birth.
Data entry tools • Google Forms • Spreadsheets • Surveys
What is a relational database? Sample sites samples Samples Species *siteID site_name latitude longitude description *sampleID siteID sample_date speciesID height flowering flag comments *sampleID siteID sample_date speciesID height flowering flag comments *speciesID species_name common_name family order A set of tables Relationships A command language
Database features:explicit control over data types • Advantages • quality control • performance
Relationships are defined between tables join Mix & match data on the fly
Structured Query Language (SQL) This table is called SoilTemp SQL examples: Select Date, Plot, Treatment, SensorDepth, Soil_Temperature from SoilTemp where Date = ‘2010-02-01’ Select * from SoilTemp where Treatment=‘N’ and SensorDepth=‘0’
If you want to try a database … • … consider trying one of these: FileMaker GUIs:
To learn more about designing a relational database • Database Design for Mere Mortals: A Hands-On Guide to Relational Database Design (3rd Edition) by Michael J. Hernandez. Addison-Wesley. 2013.
Conclusion • Be aware of best practices when designing data file structures • Choose a data entry method that allows some validation of data as it is entered • Consider investing time in learning how to use a database if datasets are large or complex CC image by fo.olon Flickr
Reproducibility • Reproducibility at core of scientific method • Complex process = more difficult to reproduce • Good documentation required for reproducibility • Metadata: data about data • Process metadata: data about process used to create, manipulate, and analyze data CC image by Richard Carter on Flickr
Ensuring reproducibility:documenting the process • Process metadata: Information about process (analysis, data organization, graphing) used to get to data outputs • Related concept: data provenance • Origins of data • Good provenance = able to follow data throughout entire life cycle • Allows for • Replication & reproducibility • Analysis for potential defects, errors in logic, statistical errors • Evaluation of hypotheses
Workflows: the basics • Precise description of scientific procedure • Conceptualized series of data ingestion, transformation, and analytical steps • Three components • Inputs: information or material required • Outputs: information or material produced & potentially used as input in other steps • Transformation rules/algorithms (e.g. analyses) • Two types: • Informal • Formal/Executable
Formal/executable workflows Benefits: • Single access point for multiple analyses across software packages • Keeps track of analysis and provenance: enables reproducibility • Each step & its parameters/requirements formally recorded • Workflow can be stored • Allows sharing and reuse of individual steps or overall workflow • Automate repetitive tasks • Use across different disciplines and groups • Can run analyses more quickly since not starting from scratch
Formal/executable workflows Example: Kepler Software • Open-source, free, cross-platform • Drag-and-drop interface for workflow construction • Steps (analyses, manipulations etc) in workflow represented by “actor” • Actors connect from a workflow • Possible applications • Theoretical models or observational analyses • Hierarchical modeling • Can have nested workflows • Can access data from web-based sources (e.g. databases) • Downloads and more information at kepler-project.org
Formal/executable workflows Example: Kepler Software Actors in workflow Drag & drop components from this list