1 / 35

SURVEYS VERSUS INSTRUMENTS

SURVEYS VERSUS INSTRUMENTS. Damon Burton University of Idaho. WHAT IS A SURVEY?. Surveys are questionnaires that can be used inductively and deductively to answer a particular research question. They are typically only used once to answer a practical research question.

boyd
Download Presentation

SURVEYS VERSUS INSTRUMENTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SURVEYS VERSUS INSTRUMENTS Damon Burton University of Idaho

  2. WHAT IS A SURVEY? • Surveys are questionnaires that can be used inductively and deductively to answer a particular research question. • They are typically only used once to answer a practical research question. • Surveys may be conducted to (a) get community preferences on capital improvement projects, (b) to identify satisfaction with recreation programming, or (c) to solicit input on possible curricular changes.

  3. WHAT IS AN INSTRUMENT? • Instruments are standardized questionnaires that are developed as deductive research tools to measure a specific construct (e.g., motivation, confidence, perfectionism, leadership) to help address multiple types of research questions to be use in many studies. • Instruments may be conducted to (a) assess students’ confidence in math, (b) to examine how perfectionism impacts problem-solving, or (c) how different teaching styles influence learning.

  4. DEVELOPMENT OF SURVEYS • Surveys are typically developed to answer specific questions. • Questions are worded based on content experts opinions of face validity. • Respondents answers to questions are typically not necessary to finalize the item pool. • Pilot testing may be used to identify administration or wording problems.

  5. INSTRUMENT DEVELOPMENT • Instruments are typically developed to be research tools. • Questions are worded based on content experts opinions of face validity. • Respondents answers are typically used to select and refine items and finalize the item pool. • 40-item instruments often start out with an item pool of 100+ questions.

  6. INSTRUMENT DEVELOPMENT • Instruments typically require 2-4 data collections to refine items to final form. • Each round the item pool is factored which groups together items that are responded to similarly to test if they match predictions for subscales. • Items that don’t factor, have low reliability, or poor item-to-subscale correlations are rewritten or eliminated. • Eventually the final item pool must confirm the instrument’s conceptual model for which items group together into subscales.

  7. CONSTRUCTION STRATEGIES Surveys Instruments Longer & more conceptually-focused More hypotheses Complex development Topics more focused 2-4 stage process to develop and validate Reliability plus concurrent, construct & predictive validity needed to confirm usability. • Quicker & easier • Few hypotheses • Simpler development • Diverse topics • Single stage process • Content validity major concern

  8. SCALE DEVELOPMENTGUIDELINES • STEP 1 – Determine what you want to measure. • STEP 2 – Generate an item pool. • STEP 3 -- Determine the format for measurement. • STEP 4 – Have experts review item pool. • STEP 5 – Inclusion of social desirability scale. • STEP 6 – Administer item pool to a developmental sample. • STEP 7 – Evaluate items. • STEP 8 – Optimize scale length.

  9. STEP 1 - DETERMINE WHAT YOU WANT TO MEASURE • Do you base the instrument on theory or create your own conceptual framework? • A good theory can be helpful in developing items. • If theory is not available, develop a conceptual framework for your scale. • Specificity helps clarity, so decide if you want to be more general or have greater specificity. • Rotter’s (1966) internal-external scale is general comparing internal versus external sources of control factors, while Levenson’s (1973) multidimensional scale measures person, powerful others, and fate as sources of control.

  10. STEP 1 - DETERMINE WHAT YOU WANT TO MEASURE • Wallston, Wallston & DeVissis (1978) developed the Multidimensional Health Locus of Control Scale based on 3 locus of control dimensions which can be made specific to a variety of medical conditions (e.g., diabetes). • Specificity can focus on outcomes (e.g., better health, business efficiency), content (e.g., anxiety), setting (e.g., school vs work), and populations (e.g., children vs adults). • Make sure your instrument only measures a specific construct and doesn’t also measure other constructs inadvertently.

  11. SPECIFICITY CASE STUDY COMPETITIVE ANXIETY • Martens, Burton , Vealey, Bump & Smith (1990) developed and validated the Competitive State Anxiety Inventory-2 (CSAI-2) which for over 20 years has been the major tool to assess state anxiety in sport. • The major problem with the CSAI-2 is that some of the items which are symptoms of physical and mental state anxiety can also measure other positive emotions (e.g., excitement or confidence). • Several researchers have developed a valence scale to the CSAI-2 so athletes could rate how much their symptoms are facilitative or debilitative to performance.

  12. SPECIFICITY CASE STUDY COMPETITIVE ANXIETY • My colleagues and I disagree with the valence scale approach because by definition anxiety is a negative emotion that is debilitating to performance. • The only way to measure state anxiety independently from other positive emotions with similar symptoms is to develop a new instrument. • The development of the CSAI-3 is a complex 3-stage process. • Stage 1 is to develop new items that measure six dimensions of state anxiety, including: worry, motivation, focus, arousal, bodily tension and affect.

  13. SPECIFICITY CASE STUDY COMPETITIVE ANXIETY • We collect data from a large sample of athletes and analyze the results to determine which items meet selection criteria. • Stage 2 is to identify how many items in the initial item pool would be considered debilitative to performance by 70% or more of athletes. • Again we collect data from a large sample and analyze results to determine how to revise the item pool that only includes true anxiety items.

  14. SPECIFICITY CASE STUDY COMPETITIVE ANXIETY • Stage 3 revises the item pool a 2nd time and the instrument is subjected to multiple validation studies to assess overall construct validity. • Concurrent validity compares the CSAI-3 with constructs that are both similar and different from the construct your instrument is measuring. • Predictive validity makes conceptual predictions about relationships between the CSAI-3 and related constructs and then test those predictions. • Intervention studies are often used to test causality, so we examine whether a program to reduce anxiety actually lowers CSAI-3 measured anxiety levels.

  15. SPECIFICITY CASE STUDY COMPETITIVE ANXIETY • Stage 3 revises the item pool a 2nd time and the instrument is subjected to multiple validation studies to assess overall construct validity. • Concurrent validity compares the CSAI-3 with constructs that are both similar and different from the construct your instrument is measuring. • Predictive validity makes conceptual predictions about relationships between the CSAI-3 and related constructs and then test those predictions. • Intervention studies are often used to test causality, so we examine whether a program to reduce anxiety actually lowers CSAI-3 measured anxiety levels.

  16. STEP 2 – GENERATE AN ITEM POOL • Generate a large pool of items that may be selected for inclusion in the instrument. • Instruments typically lose 40-75% of their items during the development process, so a Goal-Setting Inventory for Sport (GSIS) that we are currently developing had an initial item pool of 129 items for an instrument targeted for about 50 items. • Choose items that reflect the instruments purpose. • Each item should be tapping into a component of the latent variable of interest. • Latent constructs may be unidimensional (i.e., one factor or scale) or multidimensional (i.e., multiple factors or subscales).

  17. GSIS CASE STUDY • The Goal-Setting Inventory for Sport (GSIS) was hypothesized to tap goal-setting through 3 dimensions, with 4-7 subscales per dimension. • Subscales need to be a minimum of 4-5 items in order to have strong alpha reliability (i.e., internal consistency) scores of .70 or higher. If subscales are not reliable, they can’t be valid. • The initial measurement model predicted that the 1st dimension was “commitment” in which athletes were strongly motivated to reach specific goals. • Dimension 2 was “frequency” of using goals, so committed goal-setting would set goals more frequently. • Dimension 3 was “effectiveness” so setting goals frequently helps to learn the skills to make goals work effectively.

  18. GSIS CASE STUDY • Based on a review of the goal-setting literature, potential subscales were identified for each dimension. • For commitment, 4 subscales were hypothesized, including: (a) general commitment, (b) commitment to using a systematic process, (c) commitment to overcoming failure, and (d) using social support to enhance commitment. • For frequency and effectiveness, 7 subscales were hypothesized, including: (a) goal focus, (b) goal difficulty, (c) use of practice goals, (d) short- and long-term goals, (e) individual goals, (f) goal barriers and plans, and (g) goal logs and evaluation. • Although 18 subscales were hypothesized, we expect to lose 3 to 6 subscales during development. • Even though all subscales are strong conceptually, some can’t be effectively measured.

  19. GSIS ITEM CONTENT & NUMBER • Subscales may be attempting to tap a number of characteristics of “goal focus” so a good blend of characteristics is desired in the initial item pool. • How questions are written makes a big difference. Generally, shorter, more direct items with readability scores below 5th grade are desired. However, how the item is worded impacts its effectiveness. It’s hard to know what wording with work best without trying it out on a sample of athletes. • In many cases, experts rate “face validity” of items similarly, yet you write the same item 2-3 ways initially and let results tell you which item to keep and which to discard. Thus, larger initial item pools allow data to dictate item effectiveness. • We wanted to get the GSIS to around 100 items for the initial item pool, but didn’t feel we could cut below 129 items to have at least 6 items per subscale. • 129 items is a bit long, so we’ll have to accept a lower response rate. Initial data collection will take longer to get 600 responses (i.e., ~5 per item).

  20. GSIS ITEM WRITING • Developing items works best in a small group where individuals bring different, but related, perspectives to the process. What sounds good to you may not to the group. • Items can be written both positively and negatively for variety. If you want a high score to be positive, you’ll need to reverse score negative items. Some negatively scored items are often beneficial in subscales to prevent respondents from always answering in the affirmative, although not required. • Keep items short (i.e., 6-10 words) and succinct, and keep readability low (i.e., 5th grade or below). Time to take a survey is a major factor in return rate, so making instruments quick to take is a good way to increase return rate. • Include only one idea per item and avoid compound items (i.e., I use goals in practices and competitions.). • Try to use a “panel of experts” not involved in item development to evaluate items for “face validity.”

  21. STEP 3 – DETERMINE THE MEASUREMENT FORMAT • Thurston Scaling – like a tuning fork that vibrates to a specific frequency, this type of scaling consists of structuring an item so responses allow respondents to identify the response category that describes them accurately. • Typically judges place a large pool of items into categories corresponding to equally spaced intervals of construct magnitude or strength. • Nunnally (1978) indicates that it is difficult to find items that consistently resonate to specific levels of the phenomenon. • Because of the difficulty of developing Thurston Scales, they are seldom used in surveys.

  22. GUTTMAN SCALING • A Guttman Scale is a series of items tapping progressively higher levels of an attribute. • A respondent should endorse a series of adjacent items until the amount of the attribute that the items tap exceeds that possessed by the individual, so none of the remaining items should be endorsed. • A respondent’s level on the attribute is indicated by the highest item yielding an affirmative response. • Guttman Scales work best for objective information but measurement is less effective for more subjective concepts (e.g., thoughts, feelings, attitudes, values).

  23. RESPONSE FORMAT OPTIONS • Optimal number of response categories, • Specific types of response formats, • Likert scales, • Semantic differential, • Visual analog, • Numerical response formats, • Binary options.

  24. LIKERT SCALES • When using a Likert scale, the item is presented as a declarative sentence, followed by response options that indicate varying degrees of agreement with or endorsement of the statement. • Odd or even number of responses may be used, but 5 or more choices are needed to assume interval data. • Response choices should be worded so they have roughly equal interval between options. • For 6-choices, options would include: (1) strongly disagree, (2) moderately disagree, (3) mildly disagree, (4) mildly agree, (5) moderately agree, (6) strongly agree. • Likert scales are widely used to measure opinions, beliefs and attitudes.

  25. LIKERT SCALES • Don’t write items that are too strong or too weak. statement. • Use response categories to differentiate among respondents. • EXAMPLE – Exercise is an essential component of a healthy lifestyle (i.e., 6 categories from 1 (strongly disagree) to 6 (strongly agree)). • EXAMPLE – Combating drug abuse should be a top priority (i.e., 5-point Likert scale from 1 (completely untrue) to 5 (completely true)).

  26. SEMANTIC DIFFERENTIAL • The sematic differential differential was develop to measure attitudes. • Identification of the target stimulum is followed by a list of adjective pairs, with each pair representing the opposite end of a continuum (e.g., honest and dishonest). • EXAMPLE – The stimulus might be a group such as “automobile salepeople.” A list of adjective follows for the respondent to rate such as honesty, communication style, reputation, etc. • Often 7 to 9 lines are placed between the opposite extremes and respondents place a check on the line that reflects their attitude.

  27. VISUAL ANALOG • For a visual analog scale, te respondent is presented with a continuous line between a pair of descriptors representing opposite ends of a continuum. • Respondents are instructed to place a mark at a point on the line that represents their opinion. • Visual analog scales are very sensitive, although what a score means may vary across respondents. • Participants are more likely to give new responses each time the scale is administered because it’s difficult to accurately remember previous responses.

  28. RESPONSE FORMATS AND NEURAL PROCESSES • Numerical scales tend to suggest to respondents that a “quantity” evaluation is being made. • Binary options limit variability unless several items are summed to create composite scores. • Many phenomena require measuring states and traits separately. • “States” represent momentary conditions (e.g., state anxiety is your anxiety while taking this test). • “Trait” measures represent typical or average levels of the construct of interest (e.g., typical anxiety levels when taking tests).

  29. STEP 4 – EXPERTS REVIEW ITEM POOL • Enhance content validity by asking a group of people knowledgeable in the content area to review the item pool. • First, have experts review your definition of the phenomenon. • Second, they should rate each item on how well it measures what it is intended to measure. This step helps confirm the validity of your item development. • Third, experts can also evaluate items for clarity and conciseness. • Fourth, reviewers can point out ways to tap the phenomenon that your have failed to use. • Finally, use expert advice only if it is consistent with the model used to develop the instrument.

  30. STEP 5 – INCLUSION OF VALIDATION ITEMS? • Social Desirability is the tendency to respond to items in a way that the respondent feels is most socially desirable. • Social desirability is sometimes termed a “lie scale, but it more accurately reflects the tendency to consciously or unconsciously present yourself in a way that seems to be most socially acceptable for the situation. • Experts disagree how to use social desirability data. • Some experts believe that subjects with a SD score above 6 on a 10-point scale should be excluded from data analysis. • Others feel that SD should be used to assess relationships, and if the relationship is significant, try to reword items to lower SD (e.g., worry versus concern when measuring anxiety). • SD can also be measured and used as a covariate to remove its impact on phenomenon of interest.

  31. STEP 6 – ADMINISTER ITEMS TO A DEVELOPMENTAL SAMPLE • Additional refinement of your item pool should be based on analyzing data from a large sample of respondents (i.e., at least 5 per item). • If sample size is too small, patterns of covariance among items may not be stable and shift with new samples, reducing both reliability and validity. Error variance decreases as sample size increases. • The developmental sample may not represent the larger population for which the instrument is intended. Composition of the sample is as important as size. • Qualitative differences with the sample may be problematic. • For example, Study 3 in the development of the CMSQ solicited a sample from “Yahoo sport groups.” The mean age of this group was 36 years, even though other work was done with collegiate athletes. Sample differences prompted us to not use data from this group to finalize our item pool.

  32. STEP 7 – EVALUATE THE ITEMS • Initially run Frequency Analysis to examine response distributions for each item, particularly skewedness and kurtosis. • Next correlate all items with each other and examine the patterns of relationships among items and subscales. Consider reverse scoring items with negative correlations. • Conduct Exploratory Factor Analysis (EFA) on the item pool to identify if items group together consistent with predictions. • EFA is a mathematical procedure for grouping items together that are responded to in a different way. If you have hypothesized multiple subscales, EFA should contain factors that are consistent with predicted subscales. • A series of EFAs are conducted, with the first used to eliminate the items with the lowest factor loadings. However, each times items are eliminated, the correlations between items changes requiring a new EFA to look at these new relationships. • EFA is an “art” and a “science” so both empirical and conceptual decisions are made on whether to keep or delete an item.

  33. STEP 8 – OPTIMIZE SCALE LENGTH • Scale Length is based on several factors. • First, keep the scale as short as possible to (a) minimize response time to keep respondents motivated and focused and (b) make it more appealing to complete because time commitment is minimal. • Second, keep the instrument long enough to get consistent alpha reliability scores for the overall scale and individual subscales. • Typically, 4-5 item subscales work best. Alpha reliability is sensitive to the number of items in the subscale, with 5-item subscales yielding alpha reliability values of .70 or larger and 4-item scales sometimes yield solid internal consistency. • For example, the Coaching Success Questionnaire-2 (CSQ-2) has 10 4-item subscales, with the lowest alpha reliability value, whereas the CMSQ has 4 5-item subscales, with one subscale still at .67.

  34. FUNDAMENTAL RESEARCH STRATEGIES Precision of Measurement Generalizabilityof Results Reality of Measurement

  35. The End

More Related