630 likes | 899 Views
Survey Design and Analysis. Torben Schubert, December 12th, 2012, CIRCLE, Lund NORSI course on ‘Survey of Quantitative Research’. Outline. Survey Design Cluster analysis Latent factors Hypothesis testing using Community Innovation Survey data Limited dependent variables
E N D
Survey Design and Analysis Torben Schubert, December 12th, 2012, CIRCLE, Lund NORSI course on ‘Survey of Quantitative Research’
Outline • Survey Design • Cluster analysis • Latent factors • Hypothesis testing using Community Innovation Survey data • Limited dependent variables • Applicationusing STATA
Introduction • Yesterday, youhavehad an introductioninto linear regressionanalysis • OLS isonethemost powerful toolstotesthypothesis • But hypothesistestingis not theonlytask in quantitative empiricalresearch • Sometimeswemight not evenhave a clearideaaboutstructures in thedata set. Wemay find itdifficulttodevelop sensible hypothesis.
Introduction • Sometimesweencountermeasurementproblemsthatmakeitdifficulttodiscernwhatthetheoreticalmeaningof a variable or a set of variables actuallyis. • Whatcanwe do then?
The ideal waytogoodresults • Goodempiricalresearchshouldfollowthefollowingsteps: • Build a theoryabout a certainphenomenon (e.g. by literature review, orbysqueezingyourbrain) • Delineateexpectationsaboutempiricalrelationships (oftencalledhypotheses) • Collectthedatathatisnecessarytomeasureyourrelationships • Use a sensible techniquetodeterminewhetheryourhypotheses hold.
Problems • This ideal processisoftenobstructed: • Wemightgetaccessto a richdatasetthatwehave not self-compiled and whichwetherefore do not fully understand. • Wemighthave a complexmeasurementconstruct in mind, but weare not surewhetherour variables reallymeasure it.
Somesuggestions • Ifyouareunsureabouttheinformationcontained in yourdataset, do not underestimatethe power ofdescriptivestatistics. • Meansbygroupsorcorrelationscangreatlyimproveyourunderstandingofthedata. • Take your time toinvestigate an unknowndataset.
Cluster analysis • Whatis a cluster? • Looselydefined: Data canbeconsideredclustered, if • observationsbelongingtothe same clusterisalike. • observationsbelongingtootherclustersdiffer.
Cluster analysis • Cluster analysisassumesthatobservations (e.g. firms) belongto a givennumberof different clustersthatareinherently different fromeachother. • Technically, yousearchfor multivariate similaritybetweenobservationsgiving a set ofcharacteristics. • E.g. youcouldthinkfirmsdifferbyage, size, and innovativeness
Cluster analysis • A clusteringmethodthensortsthosefirmstogetherinto a givennumberofclustersthataremostsimilartoeachother. • A multitudeoftechniquesexist, but mostofthecommononesareratherdescriptiveallowingmanyarbitraryoptionstotheresearcher: • Which variables toinclude? • Howmanyclusterstogofor? • Whichmethodtouse?
Cluster Analysis • and not all dataareclustered…
Cluster analysis • An example in STATA based on theautodata set • The commandstructureis clustersubcommandvarlist, options • Type thefollowing: sysuseauto clusterwardslinkage rep78 lengthpriceif !missing (rep78) & !missing(length) & !missing(price), measure(correlation) clusterdendrogram
Cluster analysis • The dendrogramlookslikethis and tellsatwhichtolerancewestarttoclustertogetherobservations and subgroups • Numberofclusterarbitrary, but maybe 3 not a badchoice.
Cluster analysis • Then type clustergeneratecutvar = groups(3) In order togenerate a grouping variable • Togeneratesummarystatisticsbygroups type bysortcutvar: sum rep78 lengthpriceif !missing(cutvar)
Cluster analysis • Cluster analysisis a nicetoolofdataminingusefulwhenyouhavenoideaofwhatisgoing on. • Arguably, I would not recommendusingit in a scientificpaper, becauseofitsexploratorycharacter. • Itmightassistyou in earlierstagesofresearch. • Note thattherearestatisticallymoreadvancedmethods in otherpackages such as R (header: model basedclustering)
Latent factors • Oftentheoryistermed in unmeasureableconcepts. • Happens often in managementresearch, sociology, psychology • Suppose, youhypothesizethatteacherqualityincreasesstudentperformance. • Howtomeasureteacherquality? • Mightconsidertoask a batteryofquestionsabout a set ofqualitydimension (Is he well prepared? Does he reacttostudents‘ questions?...)
Factoranalysis • The firstquestionyouaskis, ifthereisreally a unidimensional thingcalledteacherquality. • Youcanusefactoranalysisforthis. • Factoranalysisdeterminesforanygiven set of variables underlying (latent) constructs. • Type in thefollowing: use http://www.ats.ucla.edu/stat/stata/output/ m255, clear factor item13-item24, ipf factor(3)
Factoranalysis • General rule: useasmanyfactorsasthereare Eigenvalues greaterthanone. • In thiscase 1: goodnews!
Cronbach‘s Alpha • AnothercommonlyusedmeasureisCronbach‘s Alpha beingdefinedastheaveragecorrelationbetween a given set of variables. • Thisshouldbe large (at least 0.65). • Type in alphaitem13-item24
Introduction • Community Innovation Survey: harmonizedsurveyofinnovationbehavior in the European Union+Norway • Movingcrosssectiondatawithmanyinformationaboutinnovationinputs, outputs, firm characteristics, markets,… • Wecananalysethisdatawiththetoolswebeenequippedwithyesterday: • T-testsaboutdifferences in means • OLS totestmorecomplicatedhypotheses • But many variables do not easilylendthemselvesto OLS becauseoftheirnature…
Overview • Limited dependent variables (LDV) • Typesof LDV • Implicationsfor OLS • EstimationMethods • Maximum LikelihoodEstimation • The needfor marginal effects • Probit and Logit Models • Multinomial Models • Count data • Tobit Models
IntroductoryReminder • What do weestimatebyregression? • Supposewehavetheregressionequation: • Wearetypicallyinterested in thecoefficients/parameters. • But whatistheirmeaning? • A commonlyheardsuggestion: • Measureshowtheexplained variable changeswhentheexplaining variables changebyoneunit…
IntroductoryReminder • Thisisimprecise. But why? • Look attheformulaagain: • The errorobstructsthisdirectrelationshipbetweentheexplained variable, and thecoefficientsas well astheexplaining variables.
IntroductoryReminder • Wesolvethatbyfocusing on expectations • The coefficientnowhasthefollowingmeaning: • A coefficientmeasureshowtheexpectedvalueoftheexplained variable changeswhentheexplaining variables changebyoneunit.
LDV - Types • Basic definition: An LDV isanydependent (also: explained, left-hand-side) variable in a regressionthatcannottakeanyvalue on the real axis. • Examples • Indicator-variables: e.g. employed (y/n) • Count variables: # patents • Strictly positive variables: amountofconsumedalcohol per week • Multinomialresponse variables: preferedleasure time activities (bowling, reading, meetingfriends)
LDV – Implicationsfor OLS • Supposeweintendedtoexplainemploymentstatusofpersons. • Convenientwayofcodingis 1: employed and 0: unemployed • Technicallywecouldrun a linear regressionofthefollowing form: yieldingestimates
LDV – Implicationsfor OLS • But considertheestimateexpectationof • Sinceisfixed and therearenorestrictionsthepredictedvaluesway well lie outside thetheoreticalboundariesof 0 and 1. • Implicationofthelinearityof OLS.
LDV – Implicationsfor OLS • Weimpose a linear model withnorestrictions on an expectedvaluethatshouldbeboundedbetween 0 and 1. • Need to find a non-linear model fortheexpectationvalue.
LDV – Implicationsfor OLS • Supposeyouwanttoexplainincome, dataiscensoredat an upperthreshold (e.g. 100,000€ p.m. and above) • Whathappens, ifyouuse OLS droppingthehighestcategory (truncation) orreplacingthecensoredvaluewith 100,000 (censoring)?
LDV – Implicationsfor OLS • Obviously, downwardbias in thiscase. • Inconsistentresultsfrom OLS.
Estimationmethods: ML • OLS doesn‘twork in thesesituations. • Common practicetherefore: • Confirmthatexplained variable is not LDV (profits), orat least roughly not LDV (sizeof a person) • If variable is LDV in some sense, useothermethodsimplementingappropriate non-linear modelsfortheexpectationvalue. • Whatarethesemethods?
Estimationmethods: ML • Gladly, the Maximum Likelihood Approach offers a flexible solutionto a large classof such problems (developedby Fisher in thebeginning 20th century) • Itfollowsseveralsteps: • Choose an appropriatestatistical model foryourdata. • Based on this model express thelikelihoodforobservingyour sample as a functionsoftheparameters • Maximizethislikelihoodovertheparameters. The solutiontothisproblemarethe ML estimates.
Marginal effects and meaningofcoefficients • Whataboutsizeoftheeffects? • Wearealwaysinterested in howthedependent variable changeswhenoneoftheindepentchanges. • Unfortunately, becausetheexpectationvalueisnow non-linear, thecoefficientsare not identicaltothe marginal effectsanymore.
Marginal effects and meaningofcoefficients • In the Probit Model forexamplewecanshowthatthe marginal effectis:
Marginal effects and meaningofcoefficients • Implications: • In the Probit model thecoefficientdoes not coincidewiththe marginal effect • Nonetheless, itgivesthecorrectdirection. Thisholdsformany ML methods but not for all. • Allways, and I seriouslymeanallways, report marginal effectsinsteadofrawcoefficientswhenusing ML. (STATA can do thateasily.)
The Probit and theLogit Model • Whenever, weencounter an indicator variable (0/1) asdependentweshouldthinkof a correctprobability model • Examples: • Unemployed vs. Employed • Non-patenting company vs. patenting company • … • Severalusablemodels, but mostcommon: • Logit model and probit model • Practically, no large differencebetweenboth, whenwefocus on marginal effects
The Probit and theLogit Model • Easy toinvokethem in STATA usingthe probit orlogitcommand probit depvarindepvars, options logitdepvarindepvars, options • Forexample, ifyouhave a patent indicatorpat, theinnovationexpendituresinnoexp and thesizeofthecompanyempl, thecommandlookslikethis: probit patinnoexpempl
The Probit and theLogit Model • The marginal effectsarecomputedusingthecommanddirectly after a probit/logitregression: mfx, predict(p) • Observethatthiscommandalwaysreferstothe last regression.
Multinomialmodels • Supposetherearemanybuying alternatives for a product (e.g. Android Smartphone, I-Phone, Windows Smartphone) and youwouldliketoknowhowcustomers‘ characteristicsimpact on therebuyingdecision • In thiscase, 4 categories: no SP Android SP IPhone Windows SP
Multinomialmodels • Differsfrom probit/logitbecausethereismorethanonecategory. • Twowidelyusedmodels: • Multinomiallogit • Multinomial probit • Herethereis a difference: multinomial probit more flexible, but calculationcomputationallyusually not feasiblewithmorethanfour-fivecategories.
Multinomialmodels • STATA commandsaremprobit and mlogit: mprobitdepvarindepvars, options mlogitdepvarindepvars, options • Forexampleyouhave a variable spgivingconsumerleveldata on SP choice, incbeingtheimcome, and agetheage, thecommandwouldbe mprobitspincage
Multinomialmodels • Obs: coefficients and marginal effects do not evenhavethe same direction • You must calculate marginal effectsusing (wehavefourcategories, eachhasitsown marginal effects) mfx, predict(p outcome(1)) mfx, predict(p outcome(2)) mfx, predict(p outcome(3)) mfx, predict(p outcome(4)) • Note: Ifdataisordered (e.g. Likertscale) youcanuseOrdered probit (oprobitwiththe same syntax)