1 / 39

FIELD DATA COLLECTION and ANALYSIS Dr. Richard Gilbert RESEARCH TRAINING WORKSHOP

FIELD DATA COLLECTION and ANALYSIS Dr. Richard Gilbert RESEARCH TRAINING WORKSHOP. Things to Consider in Planning Data Analysis Your Research Proposal a) What were your original research objectives? b) Are these objectives still appropriate, or do you need to modify them?

Download Presentation

FIELD DATA COLLECTION and ANALYSIS Dr. Richard Gilbert RESEARCH TRAINING WORKSHOP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. FIELD DATA COLLECTION and ANALYSIS Dr. Richard Gilbert RESEARCH TRAINING WORKSHOP

  2. Things to Consider in Planning Data Analysis • Your Research Proposal • a) What were your original research objectives? • b) Are these objectives still appropriate, or do you need to modify them? • 2. Your Target Audience • a) Who is the audience for your analysis? • • Academic faculty? • • Policy makers? • • Another client? • • Multiple audiences? • b) What are the expectations of your target audience, regarding the type of analysis?

  3. 3. Your Data 1. What type of analysis is possible, with the data you have collected? a) Sample size o Few vs. many cases? b) Measurement level o Nominal—categorical? o Ordinal—Likert scales? o Scale-continuous numeric data? c) Data level o Household vs. variety level? 4. Your Statistical Expertise a) Novice? b) Expert?

  4. B. Statistics, Data & Analysis 1. Role of Statistics Summarize data (descriptive statistics) Reveal relationships (measures of association) 2. Classes of Statistics Univariate - one variable (e.g,, mean, median, mode) Bivariate - two variables (e.g, Chi square, correlation analysis) Multivariate - several variables – potential “power”? (e.g., regression, logit analysis

  5. 3. Types of Data (measurement level, SPSS) Nominal data--data values represent categories with no intrinsic order (e.g., gender, types of income sources) Ordinal data– data values represent categories with some intrinsic order (e.g., Likert, rank-order scales) Scale data– data values are continuous numeric values on an interval or ratio data scale (e.g., age, income, yield) Note: You can transform scale data to categorical data, but categorical data can’t be transformed to continuous data —implications for data collection?

  6. 4. Types of Analysis (by data type) a) Descriptive Analysis 1) Nominal/categorical data Frequencies tables: describe data distributions with numbers or percents SPSS reports: data categories, numbers of observations, & percent (totals, adjusted, cumulative) May also report data as: histograms, horizon bar charts, pie charts Limit number of data categories to <10

  7. Descriptive Analysis (cont) 2) Scale/continuous numeric data Measures of central tendency “Mean” is average case (arithmetic average) Not valid for nominal/categorical data Not usually used for ordinal data (i.e., can’t assume equal distance between items) Very sensitive to distribution of scale data “Median” is middle case Use if scale data are asymmetric Use for ordinal data “Mode” is most common data value Only is an indicator of central tendency for nominal/categorical data

  8. Descriptive Analysis (cont) 2) Scale/continuous numeric data (cont) Measures of Dispersion/Spread Minimum is lowest value Maximum is highest value Range is high/low interval Standard deviation (SD) indicates percent of cases in certain ranges (if data are normally distributed) Shape of the Distribution (for scale data) “Skewness” shows degree & direction of asymmetry Symmetrical, coefficient = 0 Skewed left, coefficient = positive (left) Skewed right, coefficient = negative (right)

  9. Shape of the Distribution (for scale data) cont “Kurtosis” measures peakness of distribution Same as normal distribution, coefficient = 0 Very peaked, coefficient = positive Very flat, coefficient = negative Note: If skewness or kurtosis value is not close to 0, Mean isn’t an appropriate measure of central tendency Standard deviation isn’t an accurate measure of dispersion

  10. 4. Types of Analysis (by data type) (cont) b. Analysis of the Relationship/Association Between Variables Question: Do pairs of variables move together or are they independent? Bivariate analysis does not require you to assume/identify a dependant/ independent variable Multivariate analysis assesses the relationship between a dependant & independent variables “Dependant variable” - variable being affected “Independent variable” - variable(s) affecting the dependent variable Correlation not causation: statistics that measure association do not indicate causation; only theory implies causation

  11. Choice of appropriate statistic to assess relationships depends on: Type of variables: nominal (categorical) or scale (continuous) Which variable is independent, dependent Considerations in Choosing a Statistical Method A Guide for Selecting Statistical Techniques for Social Science Analysis: Andrews,

  12. Strategies for Analyzing Survey Data • Review your research objectives, hypotheses, and questionnaire • 2. Develop a tentative report outline (analytical plan) • 3. Use descriptive statistics to explore your data • frequencies, mean, median, mode, standard deviation, skewness • What sub-group comparisons are possible? • What association can you assess with the data? • 4. Revise you analytical plan, based on you new knowledge, regarding the characteristics of the data • 5. Finally, use bivariate/multivariate statistics to assess • relationships/associations

  13. Strategies & Considerations in Using of Statistics Begin with simple descriptive analysis, then look for associations to “explain” relationships 1. Describe the Variables a) Nominal/categorical variables 1) Strategies to consider: First run frequencies/percents Figure Recode (combine) categories if percentages are: - Too low (too many categories) - May want to regroup category with few cases into “other” - But, keep original variables with original codes in an archive file or rename variable before recoding

  14. Recode (combine) categories if percentages are: (cont) Recode large categories of “other” to specific categories Recode continuous data into a few groups (e.g., recode continuous variable “education” to: 0-11, 12, 13-15, > 16; or attitude scale (1-5) to 1-2, 3, 4-5) Review frequency distribution to decide how to regroup data (e.g., first ½=low, second ½=high; first 1/3=low, second 1/3, medium, third 1/3=high) After recoding data, update variable values information for new/recoded variables 2) Statistics: Mode it the appropriate statistic for assessing central tendency

  15. b) Scale (interval/ratio, continuous data) variables 1) Strategies to consider: Run means, mode, median, range, skewness, kurtosis, and standard deviation Then, look for outlyers; assess normal distribution assumption 2) Statistics: If data ARE approximately normally distributed: Present mean, mode median If data are NOT approximately normally distributed: Recode to categorical data and present spread in frequency table

  16. 2. Looking for Relationships: Statistical Inference Making inferences about the population parameters from estimates of sample statistics (requires random sampling) a) Some Concepts 1) Standard Error of Estimate Background We sample from a population to generate sample statistics to estimate unknown population parameters. Different samples will give different estimates. The theoretical distribution of all possible values of a statistic obtained from a population is the sampling distribution of the statistic. The mean of sampling distribution is the expected value of the statistic. The standard deviation is the standard error. When we estimate the SE from a single sample SD SEx = --------- N1/2

  17. SE of mean (SPSS descriptive statistics option) indicates how close/far the sample mean is to population mean For means of interval/ratio data & percentages, report the SE and the margin of error, which is a multiple of the SE At 99% CI, ME=2.57 SE At 95% CI, ME=2.00 SE At 90% CI, ME=1.65 SE Sample Size and Data Distribution–A Caution If sample is large, sampling distribution of sample mean is approximately normal, even if population was not normally distributed. If population is small and not normal, the sampling distribution of mean won’t be normal, limiting statistical inference (e.g., use non-parametric statistics)

  18. 2) Confidence Interval (CI) A range around sample mean, based on the SE (i.e., 95% CI is range +/- 2 SEs) SE and CI indicate reliability of a statistic b) Statistical Significance These statistics all show the degree of association & statistical significance (non-significance) Significance indicates the probability that a relationship exists in sample, if it doesn’t exist in population (e.g., 1% probability that you accept a false Ho as true) Alpha/critical level of probability for acceptance is researchers/sponsor determined

  19. Traditional alpha levels of 99%/95% are conventions, not absolutes. Must consider the consequence of accepting a false result as true Example A traditional varieties yields 500 & a modern variety yields 800 kg/ha, but the difference is only significant at the 80% level. Each variety cost the same price. Would you plant the MV? It’s often more informative to report the level at which your results are significant, rather than simply saying they are non-significant (e.g., The means are significantly different at the 88% level) Lack of statistical significance may be due to the fact that: No relationship exists Non-sampling error was large, so data are not accurate The sample size is small, so the SE is large

  20. Statistical significance does not mean importance!!!!! The importance of a result is a function of the size of the coefficient & the meaning that the variables/relationships imply. Statistical results are either significant or non- significance, not insignificant. A result may be “statistically significant”, but still insignificant (i.e., very small, and thus not important) Even if the differences in the numerical values are large (e.g. mean yields of 500 kg/ha vs. 1,000 kg/ha), if the relationship is non-significant, this implies that the values are essentially the same. So, don’t emphasize the magnitude of the non- significant difference when reporting your results.

  21. c) Measures of Association Used to Analyze Survey Data • Cross-tabulation (Chi square analysis, X2) • Objective • To test if the distribution of one variable differs significantly for values of other variable • Data Requirements: • Both variables must be categorical • (i.e., nominal, ordinal) • But you can convert scale data (i.e., interval, ratio) • variables to categorical variables • Don’t need to assume the data are normally distributed • Don’t need to identify a dependent/independent variable • Most common measure of association for survey variables

  22. Caution The X2statistic is invalid if the expected value is <5, but SPSS will still report a X2 value The cell in a cross-tab table with the smallest expected frequency (not the actual frequency) is the one on the row with the smallest row total & the column with the smallest column total To estimate the expected cell frequency, divide the smallest row total in the cross-tab table by N & multiply this number by the smallest column total. Suggestions You make the row/column variable not critical It’s confusing to interpret the results if you request both column & row percents, so request only column percents

  23. If N is small (< 200?), construct cross-tab tables with 3 or fewer categories/variable If the N is very small (< 100?), use the results in the cross-tab table to estimate the “expected frequency”, If the expected value < 5, recode the data into fewer/more equal size groups to increase the expected value Statistics X2statistic (larger is better) & the probability level (smaller is better) In the text, report the direction of the observed relationship & probability level (in parentheses) [e.g., X2analysis indicates a significant (95% level) negative relationship between age & education] In the table, report cross tab results, X2statistic & the probability level

  24. 2) Analysis of Variance (one-way) Objective Determine if the mean values of the dependant variable are for each category of the independent variable significantly different (T-test is special case) Data Requirements Must identify an independent & dependant variables Independent variable: categorical data with 2 or more categories (e.g., 2 or varieties) Dependent variable: interval/ratio scale (continuous) data (e.g., yield of several varieties) Each case of the dependant variable must be independent of the other Caution Spread of data points (I.e., variance) in independent variable must be similar for each data category & normally distributed

  25. Suggestions Test for homogeneity of variances Don’t use ANOVA, if variances are very different or sample sizes of groups differ greatly Statistics: F-test evaluates significance (i.e., HO that all means are equal) Multiple comparisons test (Sheffe) indicates if individual means are different (pairwise comparisons) In text, report direction of the relationship, significantly different means & F-test statistic [e.g., ANOVA indicates the mean yield of variety A (845 kg/ha) & B (933 kg/ha) are significantly (95% level) higher than the yield of variety C (534 kg/ha), with a F-value of 6.75] In tables, report group means, F-test (probability level for the ANOVA) & the multiple comparison test (Scheffe) results

  26. 3) Correlation Analysis Objective Measures the degree that 2 continuous variables move together from one case to another Data Requirements Both variables must be interval/ratio scale (continuous) or ordinal data Don’t need to identify a dependant/independent variable Suggestions Run correlations to explore potential relationships Statistics Different types of data require different statistics For interval/ratio scale data, use Pearson’s product moment correlation For ordinal data, use Spearman rank correlation

  27. Correlation coefficient (r) indicates strength of relationship & ranges from 0 to +/-1 Sign indicates direction of relationship Sign positive (+), direct Sign negative (-), inverse Coefficient of determination (r2) indicates the percent of shared variance In text, report direction of the relationship (positive/negative), correlation coefficients (r) & r2 [e.g., Correlation analysis indicated that yield & N- fertilizer rates are positively correlated (r =0.79), with a R2 of 0.62] In table, report correlation coefficients (r), signs, probability level (r2) May present several variables/correlations in matrix format, often included as an appendix

  28. 4) Regression Analysis Objective Measures the relationship between continuous independent & dependent variables Data Requirements Must identify 1 dependant variable, 1 or more independent variables Independent & dependant variables are usually interval or ratio scales data But can use dummy independent variables (0,1) in multiple regression Linear models most common, but can use other functional forms, depending on your (e.g., log, quadratic)

  29. Suggestions Scatter plots indicate data distribution, which must be well distributed over range of data values Print out scatter plots of dependent/independent variables (e.g., yield, fertilizer) & assess the scatter plots to find outliers Check for outliers before running a regression & consider dropping cases with extreme/impossible values (i.e., small plots > measurement error) Use theory (and possibly scatter plots) to specify model & functional form, but avoid stepwise procedure (data mining) (e.g., theory indicates yield increase with higher N application & then decline, but farm-level data seldom includes extremely high N rates)

  30. Review correlation matrix (Figure) to identify highly correlated (>90%) variables (multi-collinearity) in model, and then drop one or more such variables Missing data for any variable will eliminate that case from the model, especially a problem in multiple regression “Good fit” (R2) is a function of the type of relationship–social analysis often gives low R2 Avoid including dominant independent variables (e.g., Production = harvested area, fertilizer, labor, etc.). Can use standardized coefficient model to get percent contribution of independent variables

  31. Statistics Constant shows value of dependant variable when independent variable(s) is (are) zero Regression coefficient indicates change in dependant variable with 1 unit change in the independent variable Significance of a coefficient is estimated by dividing the coefficient by its SE, then compare to the t-distribution value R2 indicates strength of the influence of the independent variables on the dependant variables; ranges from 0-1 (i.e., none/complete); F-value indicates the probability that all betas are equal

  32. In text, report direction of relationship, coefficient, significance & R2 & F-value e.g., Regression analysis indicated that N (0,44) & weeding days (0.22) were significantly associated (95% level) with yield. The model had a R2 value of 0.65 & a significant (99%) F-value. Also, list & discuss non-significant coefficients —Why are they non-significant? In tables, report all variables, coefficients, SE (in parentheses below coefficient), significance,. levels (*= .01,** = .05,*** =.10), F-value & R2 Note: Many relationships that are significant in bivariate relationships, will be non-significant in a multivariate model

  33. 5. Logit & Probit Analysis Objective Measures the degree & direction of the relationship between a continuous independent variables & a category of a dependant variable Data Requirements Dependant variable is categorical (e.g., adopter/non-adopter) Independent variable is continuous (interval or ratio scale) Statistics Number of cases correctly classified, contribution of each independent variable to prediction (coefficients), significance of each independent variable

  34. Responsibility for Analysis Primary responsibility lies with the researcher who: Designed the project, Identified the research issues, Developed the questionnaires, Supervised data collection & therefore, Knows the analytical needs & limitations of the data

  35. Documenting the Project Purpose: Provide permanent record of project Provide reference for your analysis Provide reference for other users 1. Archive Project Materials & Leave at the Research Location Assemble questionnaires (for future reference), post-coding sheets, etc. Make a copy of the data on CDs Make a copy of the Project Documentation Categorize, label & store all material in a safe place that is protected from heat (sun), magnetic interference

  36. Project Documentation (bound volume) Project Documentation (summary) project title, sponsors, geographical coverage, dates, project overview, publications Description of Survey Methodology overview of research issues, survey locations, sampling method/limitations, enumerator selection/ training, module design process, survey instruments, data entry Survey Documentation (for each module) purpose, topics covered, sample size, data level, unit of observation, number of rounds, survey areas & dates, time reference for data (season, months), base fine name, copies of modules (all languages), names of enumerators & respondents by survey location

  37. SPSS Systems/Data File Summaries (all SPSS files) name of base file, source of data (module name), description of data, data limitations, “file information” printouts, history of base file modifications/transformations including names of new files created

  38. Suggestions for Documenting Modified Systems Files Failure to update files/variable descriptions is a major problem a) Recoded/Computed Variables Don’t recode the original variable; first create a new variable with the same data Name recoded/computed variable with a name that begins with R/C to indicate it was recoded/computed Immediately create “value labels” for the new variable Describe variable transformations in the variable label [i.e., Yield (yield=prod/area)]

  39. b) Keep a Permanent Record (file) of Data Transformations Paste SPSS commands into the “Syntax Editor”, then run them from the editor. Save this file! At the end of the first SPSS session, copy the syntax that you want to save/archive into a word processing file and at the end of each subsequent SPSS session, add the new syntax commands to a word processing file c) Periodically Print out “File Information” After making transformations, print out the new “file information” d) Cleaning Up Your Current Work File After transforming a variable, drop old variable from the current version of the file Be sure to save the original variable in an earlier version of the file

More Related