250 likes | 352 Views
General Principles of Data Analysis. C ho ice of an appropriate statistical technique a complex issue somewhat arbitrary Real-life data often contain mixtures of different types of data two statisticians may select different methods
E N D
General Principles of Data Analysis • Choice of an appropriate statistical technique • a complex issue • somewhat arbitrary • Real-life data often contain mixtures of different types of data • two statisticians may select different methods • depending upon what assumptions they are willing to take into account • extraneous factors • availability of software and its limitations • availability of time and financial resources
General Principles of Data Analysis • Warnings • Figures allow us to calculate them • Applying different techniques and obtaining different results does not mean that something is wrong • Looking for an answer to the same question by using several methods may lead to a better understanding • Obtaining negative results may be as informative as getting a positive one • Obtaining no answer by using one technique, does not mean that there is no answer at all • Etc.
General Principles of Data Analysis • The choice of a statistical technique depends essentially upon • Characteristics of the analysis question; • Characteristics of the data; • Characteristics of the sampling design. • Characteristics of the Analysis Question • Whether there is a distinction between independent and dependent variables or not? • Whether the nature of the research problem requires: • Description, exploration, estimation, or • Testing of a hypothesis or model • Whether the focus of research is on 'variables' or 'objects‘.
General Principles of Data Analysis • Characteristics of the Data • Types of data sets • Individuals - variables data sets • Proximities data sets • Variable - Variable Proximities • Individual - Individual Proximities • Types of Variables • Continuous or Quantitative Variables • Discrete or Qualitative Variables • Variable types by measurement level • Interval-scale variables • Ratio-scalevariables • Nominal-scale variables • Ordinal-scale variables
General Principles of Data Analysis Techniques for problems without distinction between independent and dependent variables
General Principles of Data Analysis Techniques for problems with distinction between independent and dependent variables
General Principles of Data Analysis • Usual way of statistical problem solving • Formulate the question using terms and logics of the specific field of the problem (science management, pedagogy, economics, etc.) • Reformulate the question using statistical terms and logics • Find appropriate statistical model(s) and technique(s) • Use the selected model(s) and technique(s) • Give statistical interpretation to the results obtained • Reformulate the interpretation with terms of the original field of application
Scientific products by country Question in research management Research groups have multiple outputs comprising publications, patents, experimental materials etc. What are the differences if any in the performance of the Research Groups of selected countries? Statistical question Can we construct a reasonable productivity index, using the following measures of the scientific output Articles in country Patents Articles abroad Algorithms and designs Original research reports Experimental material Can we find a significant difference by countries in the productivity index?
Scientific products by country • Statistical model and technique • Partial order scoring for constructing the index of research output • Analysis of variance for testing the hypothesis concerning the significance of the difference • Use of the selected model and technique $RUN POSCOR $FILES PRINT = POSCOR.LST DICTIN = R2R3RU.DIC DATAIN = R2RU.DAT DICTOUT =POSCOR.DIC DATAOUT =POSCOR.DAT $SETUP POSCOR SCORES OF RU OUTPUTS BADDATA=MD1 - IDVAR=V2 - TRANSVARS=(V1) POSCOR ORDER=DESR - ANAME=‘RU OUTPUT’ – VARS=(V116,V118,V122,V126,V128,V130) $RUN ONEWAY $FILES PRINT = ONEWAY1.LST DICTIN = POSCOR.DIC DATAIN = POSCOR.DAT $SETUP ANALYSIS OF VARIANCE OF RU OUTPUT BADDATA=MD1 - PRINT=CDICT DEPVARS=(V8) CONVARS=(R1) $RECODE R1=RECODE V15 (40)=1, (360)=2, (410)=3, (638)=4, (844)=5, (868)=6
Scientific products by country Use of the selected model and technique (results)
Scientific products by country • Statistical interpretation • The F( 5,1454)=56.018 value shows that there is a highly significant difference by country in the constracted performance index. • We see also a medium strength differentiation between the countries: Eta(adj)=0.398. • The Mean values show the level of each country. Interpretation for research management There are two countries with low, two ones with medium and two other ones with high productivity index. Source P.S. Nagpaul: Guide to Advanced Data Analysis using IDAMS Software
Performance, motivation and creativity of school children Question in psychology - pedagogy Intellectual performance, motivation and creativity of school children can be measured by using several indicators. Some of them are produced by the children themselves (e.g. IQ tests) others are based on the evaluation given by their teachers (e.g. average grade). What are the perceivable dimensions if any behind these indicators? Statistical question In the set of the listed indicators, are there any groups within which statistical inter-correlation and between which statistical independence can be detected? T Average grade T Creative behaviour C IQ C Achievement motivation C Creativity test T Motivated behaviour C Creative attitude T Motivation index
Performance, motivation and creativity of school children • Statistical model and technique • Pearsonian correlation between the measured indicators • Multidimensional scaling, cluster analysis • Use of the selected model and technique • Executing PEARSON, MDSCAL, CLUSFIND in IDAMS • MDSCAL result Children Teachers
TCreative behaviour 0,75 TMotivated behaviour 0,40 C IQ 0,27 TAverage grade 0,71 0,45 TMotivation index 0,02 C Achievem. motivation C Creativity test 0,13 C Creative attitude Performance, motivation and creativity of school children Use of the selected model and technique CLUSFIND result
Performance, motivation and creativity of school children • Statistical interpretation • Multidimensional scaling shows clear separation of indicators produced by children and teachers • Cluster analysis supports the finding of the separation of variables coming from teachers and children Pedagogical/psychological interpretation Just one aspect: ratings given by teachers to children are nearly the same, independently of the evaluated ability, attitude or behaviour dimension Source M. Hunya: Multidimensional statistical techniques in pedagogical studies Data A.Deak, B. Kozeki: Study into the effect of motivation and creativity factors on the performance of school children
Prediction of river flow values • Question in hydrology • We have water level data on four rivers in North-Africa (mor than 40 years). Can the water flow level be predicted on the basis of data from the past? If so, with what precision? • What if the average flow level is considered instead of the individual ones? • Statistical question • Can the river flow values be predicted by using a set of values from the preceding period? • How does the prediction change if 6 month average flow is used?
Prediction of river flow values • Statistical model and technique • Autoregression model (with a lag of 12 to 36) applied to the river flow time series • Transformation of the original data into a time series of moving averages (interval length = 6) • Use of the selected model and technique • Time Series Analysis option from the IDAMS interactive facilities • Original series Moving average series • 12 months R**2=0,32 12 months R**2=0,92 • 24 months R**2=0,35 24 months R**2=0,93 • 36 months R**2=0,36
Prediction of river flow values Use of the selected model and technique Original series Moving average series
Prediction of river flow values • Statistical interpretation • Autoregression shows that individual values can be predicted (Unbiased R**2 = 0,32 - 0,36; for 12 to 36 months) with moderate or avarage precision, high peak values are very poorly reproduced. • In the case of a 6 month moving average, the prediction is nearly perfect (Unbiased R**2 = 0,92; for 12 months). Hydrological interpretation • Although the pattern of changes can fairly be reproduced, even three years data from the past are not enough at all to predict the height of peak flows. • But if we consider 6 month averages, they can be predicted almost with full precision. Data UNESCO, Water Science Division
Business • Question concerning company management • What are the factors that influence the economic performance of a company?Economic performance is measured by the return on capital employed. • Statistical question • Can the return on capitalbe predicted by using a set of economic and production indicators from those characterizing the company? • How does the prediction change if we are loking for a subset of best predictors? Statistical model and technique • Multiple linear regression • Stepwise regression
Business • Use of the selected model and technique • Running REGRESSN • Results • The full regression model explains 70% of the adjusted variance of the dependant variable. Its standard error is about one half of the mean, value of the determinant of the correlation matrix is .79478E-05. There are 8 variables(out of 12) with high covariance ratiovalues. • The stepwise regression model selects 3 variables for explaining 80 % variance. No multicollinearity (0.77647 ). Standard error of the estimate of the dependent variable = 0.06135 which is quite low: high reliability of estimation.
Business • Statistical interpretation • Full regression model:the reliability of prediction is poor. Strong multicollinearity is shown. Variables, which contribute to multicollinearity can be identified • The stepwise regression model: 3 variables for explaining 80% variance. No multicollinearity. High reliability of estimation. Interpretation for management • Although the full indicator set can give nice prediction, it can not be suggested for real use because of the poor prediction reliability. • But if we consider 3 carefully selected indicators, we can get a fair prediction. Source P.S. Nagpaul, India
Education • Question concerning measurement of knowledge level Tests are used very often in education for checking the level of knowledge in one or in another subject. Long tests with many questions can meet relatively easily the reliability requirement. The question is if we can make a short interactive, adaptive test from a long test, preserving at least nearly the original reliability. • Statistical question Can we give a good estimate of the original test value by using a tree structure based prediction? Statistical model and technique Regression tree
Education • Use of the selected model and technique • Running SEARCH • Results Starting from a standardized test (for checking a specific verbal aptitude) containing 20 questions, a regression tree with 3-4 questions was obtained. The regression tree contains 10 final subgroups (leaves) with estimates for the original test value ranging from 6,4 to 59,2. The explained variance is 90,4%.
Education • Statistical interpretation A very good estimate can be given for the original test value by using the obtained regression tree. Interpretation for test designers • Using the the tree structure, cumputer assisted test can be constructed, which is much shorter, without loosing the power of the original test. Source M. Hunya: Finding optimal interactive test structures (1982)