Guide to Reliable Data Collection in Statistical Analysis | Importance and Methods

We saw in the previous lectures that the quality of result of research is –amongst other things – dependent on the quality of the statistical analysis we perform to obtain such results. Statistical analyses are however performed on data. Without data having been collected, no statistical analysis is possible. So, data collection is a central aspect of research. Data that is to be collected however is the basis for all the inferences we would eventually make. As such these data must be: • Reliable • Valid

We should remember that: Reliability is the property of independent reproducibility of the data. Validity is the quality of a measure being an adequate and acceptable representative of what it is supposed to represent. We can not have validity without reliability, but a measure can be reliable without being valid. For example,measuring the number of new-line characters in a program as a measure of its quality is reliable but certainly not valid.

Therefore we must ensure that the data we collect are (amongst other characteristics) both reliable and valid. To do so, we need to pay special attention to all the steps leading up to data collection. But particularly to the: • Formulation of the operational definition, and • The design of the procedure as to remove threats to validity. In other words, threats to reliability and validity are best handled procedurally, and BEFORE data collection begins.

Ensuring reliability: We mentioned three different types of reliability considerations: • Test-retest reliability • Inter-rater reliability • Internal-consistency reliability We now must ensure that such reliability is evidenced in our measurement by designing our procedures in such way that provides such confidence.

The best course of action to take is to provide for each measure as precise an operational definition as possible, and for all the measurements, a precise operational procedure. Doing so significantly increases the probability of scoring higher on test-retest and inter-rater types of reliability evaluations. Example: The ISO15504 (SPICE) standard for software process improvement and capability determination has a very precisely devised and followed procedure for data collection. This is solely to increase the inter-rater and test-retest reliability of the measure.

Another way of increasing test-retest and inter-rater reliability is to reduce the scale and the levels of measurement. Examples: Would a number of raters have better agreement when putting a program into a category clean versus buggy (nominal) or into an ordinal scale of “bugginess” from 1-10 or to predict its exact defect density? (given the same operational definition in all three cases)

To increase internal consistency, several compatible operational definitions are needed. However, internal consistency cannot be assured unless a series of pilot measures are made. Example: If whilst measuring the number of lines of code in a program by hand, we get a number say 120, doing it via an automated line counter we get 127, and doing it by using the compiler’s line numbering mechanism we get 112, then our measures would not be consistent. To make them consistent, all three approaches must be made to conform in the way they measure lines of code. This means having compatible operational definitions.

Ensuring Validity: Validity, we said, came in a number of “flavors”. These were: • Statistical Validity • Construct Validity • External Validity, and • Internal validity

Statistical Validity: We say that a measure is statistically valid when we can demonstrate that they did not arise by chance. One threat to statistical validity is when the data is not reliable. We have already discussed how to increase reliability. Another threat is the researcher’s violation of the assumptions that underlie statistical tests. For example when the researcher uses a test appropriate for independent groups on data that is internally correlated. Use of appropriate statistical procedures on reliable data is the best way of improving statistical validity.

Construct Validity: Construct validity refers to how well the study’s result (or data) support the underlying principles relevant to the work. Construct validity would be in question when the evidence can be explained in more than one way; that is according to more than one hypothesis or theory. Example: A research project “showed” that female programmers score consistently lower in their annual appraisals compared to their male counterparts; thus females must be poorer programmers. This is a prime example of dubious construct validity.

Because: • It may be that the hypothesis is true and females are indeed poorer programmers than males, or that • Females are discriminated against and are not rated fairly in a largely male-dominated and sexist work environment, or that • In the current society, those females with talent for programming would be first attracted to other professions, or that • Females do not care as much about performance evaluations as males do and therefore do not argue for a higher evaluation score during annual evaluation.

For the researcher to have his (could not be hers, could it?) hypothesis accepted (hypothesis 1), he has to design, perform and publish research that refutes each and every of the remaining hypotheses. Those listed here, or any other that might emerge.

External Validity: External validity refers to the degree to which we are able to generalize the results of a study to other subjects, conditions, environments, times and place. To make a generalization from the sample to any population, the sample must be an adequate and acceptable sample of THAT population. Example: In a research project rates of defect detection of particular testing schemes were calculated and contrasted. It turned out that Scheme A was better than scheme B in discovering defects of functionality and usability.

To do so, the researchers used 30 programs written by masters students and using the C programming language. Each program was of between 20 and 1000 lines long. How well would the result that method A is better than B for discovering functionality and usability defects transfer to programs written in the object-paradigm, in Eiffel, and by professionals? The answer is that WE DON’T KNOW. To answer the question we must first find if the sample is representative of the population. Despite the differences in language, programmer background and paradigm, the sample may STLL BE representative!!!

The problem of generalization from sample to population is often best controlled by random and adequate selection of subjects from the population. The researcher must be careful that if he or she wishes to extend the finding to a particular group, that such group should at least be represented.

Internal Validity: Internal validity deals with the concern whether there was causality at play. In other words; “Was the independent variable and not some extraneous variable responsible for the observed changes in the dependent variable?”. There are many factors that can interfere with internal validity. These are collectively called confounding factors. We must minimize the effect of these factors in order to increase the internal validity of our work.

Threats to internal Validity: • Attrition: Loss of subjects during study. Differential loss is particularly problematic as those who drop out are usually the interesting ones. • Diffusion: When information “leaks” from one subject or group to another and thus modifies behavior. • Experimenter effects: The inadvertent or intentional action of the experimenter that might compromise the study. • History: Changes in the dependent variable that are due to historical or time-based events but are not related to the study.

Instrumentation: Any change or change in calibration of the instruments. • Learning : Changes in the dependent variable that occur due to learning done as a result of participation in the study. • Maturation: Changes in the dependent variable that occur during the course of study due to normal passage of time and maturation/development of the subject. • Placebo effect: The effect that the subjects might compromise the results by behaving in a certain controlled way through knowledge of the result being sought. E.g. when subjects feign drunk even when given unlaced tonic.

Regression to the mean: The tendency for subjects that had extreme scores in earlier phases to be less extreme in follow-up scoring. • Sequencing effect: The impact of the experience a subject had in one situation on the next situation. • Testing: The impact of the subject having been tested before.

Some controls to threats of validity include: • Use of calibrated and proper preparation of equipment. • Replication • Single and double blind procedures • Automation • Multiple observers • Use of deception (within the bounds of ethics) • Random subject selection • Control of subject-to-subject communication

Getting ready for data collection: • Have a clear, literature supported initial idea • Have a clear and identifiable statement of problem. • Ensure that all variables are identified and operationally defined. • Develop a clear research hypothesis • Select your statistical analysis procedures • Clearly identify the theoretical bases of your intended study • Identify if the hypothesis and procedures address the issue.

Ensure that the independent variable manipulation has been carefully planned to ensure reliability and validity • Pre-test (pilot) the manipulations. Make any changes necessary • Ensure that all dependent variables are adequately defined through operational definitions. • Pres-test and pilot dependent variables • Put all controls of reliability and validity in place • Ensure the sample is representative • Ensure the sample is sufficiently large

Ensure correct assignment of subjects in accordance to the conditions in the research design • Ensure subject availability and produce a data collection schedule. • Ensure all ethical issues have been addressed and that all ethics preserving procedures are in place. • Ensure the logistics of the study. E.g space, equipment, personnel, instruction lists, labels, etc. • Go for it.

Guide to Reliable Data Collection in Statistical Analysis | Importance and Methods