260 likes | 454 Views
Evaluation Methodology. Fatemeh Vahedian CSC-426-Week 6. Outline. Evaluation Methodology Work Load Experimental design Rigorous analysis Measurement Levels of measurement Reliability True score theory of measurement Measurement Error Theory of Reliability Reliability Types
E N D
Evaluation Methodology FatemehVahedian CSC-426-Week 6
Outline • Evaluation Methodology • Work Load • Experimental design • Rigorous analysis • Measurement • Levels of measurement • Reliability • True score theory of measurement • Measurement Error • Theory of Reliability • Reliability Types • Construct validity • Measurement validity types • Idea of Construct Validity • Convergent Validity • Discriminant Validity • Threats to Construct Validity • Approaches to Assess Validity
Evaluation description • Professional evaluation is defined as the systematic determination of quality or value of something (Scriven 1991) • Evaluation methodology underpins all innovation in experimental computer science • Evaluation is a systematic determination of a subject's merit using criteria governed by a set of standards • Evaluation is the structured interpretation and giving of meaning to predicted or actual impacts of proposals or results
Evaluation Methodology • Evaluation methodology requirements: • relevant workloads: • appropriate experimental design: • rigorous analysis:
Work Load • Relevant and diverse • No workload is definitive: but does it meet the test objective? • Each candidate should be evaluated quantitatively and qualitatively (e.g. specific platform, domain, application) • Widely used (e.g. open source applications) • Nontrivial (e.g. toy systems) • Suitable for research • Tractable: Can be easily used, manageable • Repeatable • Standardized • Workload selection: it should reflect a range of behaviors and not just what we are looking for
Experimental design • Meaningful Baseline • Comparing against the state of the art • Widely used workloads • Not always practical (e.g. unavailable implementation for public) • Comparisons that control key parameters • Understanding what to control and how to control it in an experimental system is clearly important. • Comparing two garbage collection schemas while controlling the Heap size • Bound and free parameters • Degree of freedom
Experimental Design • Control Changing Environment and Unintended Variance • E.g. Host platform, OS, Arrival rate • Scheduling schemas, workloads • Network latency & traffic • Differences in environment • Controlling nondeterminism • Understand key variance points • Due to nondeterminism quality of the results usually does not reach the same steady state on a deterministic workload • Take multiple measurements and generate sufficient data points • Statistically analyze results for differences in remaining nondeterminism
Rigorous analysis • Researchers use data analysis to identify and articulate the significance of experimental results • Challenging in complex systems with sheer volume of results • Aggregating data across repeated experiments is a standard technique for increasing confidence in a noisy environment • Since noise cannot be eliminated altogether, multiple trials are inevitably necessary. • Reducing nondeterminism • Researchers have only finite resources, Reducing sources of nondeterminism with sound experimental design improves tractability. • Statistical confidence intervals and significance • Show best and worst cases
Measurement and Levels of Measurement • Measurement is the process observing and recording the observations that are collected as part of a research effort • Level of Measurement: • The level of measurement refers to the relationship among the values that are assigned to the attributes for a variable
Measurement and Levels of Measurement • There are typically four levels of measurement that are defined • Nominal • Ordinal • Interval • Ratio • Why the levels of measurement is important? • knowing the level of measurement helps you decide how to interpret the data from that variable • Helps to decide what statistical analysis is appropriate on the values that were assigned
Reliability • Reliability is the consistency or repeatability of your measures • True score theory of measurement: Is the foundation of reliability theory • Different types of measure error • Theory of reliability • Different types of reliability • The relationships between reliability and validity in measurement
Reliability/ True score theory of measurement • Consists of two components: • true ability (or the true level) of the respondent on that measure • random error • Why it’s important? • reminds us that most measurement has an error component • true score theory is the foundation of reliability theory • A measure that has no random error (i.e., is all true score) is perfectly reliable; a measure that has no true score (i.e., is all random error) has zero reliability • true score theory can be used in computer simulations as the basis for generating "observed" scores with certain known properties Observed score True ability Random error + =
Measurement Error • Random error • Cause by any factors that randomly affect measurement of the variable • Sum to 0 and does not affect the average • Systematic error • Cause by any factors that systematically affect measurement • Consistently either positive or negative • Reducing measurement error • Pilot study • Training • Double check the data thoroughly • Use statistical procedure • Use multiple measures
Theory of Reliability • In research, reliability means repeatability or consistency • A measure is considered reliable if it would give us the same result over and over again • Reliability is a ratio or fraction: • true level/the entire measure—var(T)/var(X) • Covariance(X1,X2)/sd(X1)*sd(X2) • We can not calculate reliability because we can not measure the true score but we can estimate (between 0 and 1)
Reliability Types • Inter-rater or inter-observer reliability • Raters for categories • Calculate the correlation • Test-retest reliability • The shorter the time gar, the higher correlation • Parallel-forms reliability • Create a large set of questions that address the same construct, divide into two set and administer the same sample • Internal consistency reliability • Average inter-item correlation • Average total-item correlation • Split-half reliability • Cronbach’s alpha (a)
Reliability & Validity • center of the target is the concept that you are trying to measure
Construct validity/Definition • Construct validity has traditionally been defined as the experimental demonstration that a test is measuring the construct it claims to be measuring • What is construct? A construct, or psychological construct as it is also called, is an attribute, proficiency, ability, or skill that happens in the human brain and is defined by established theories • A construct is a concept. A clearly specified research question should lead to a definition of study aim and objectives that set out the construct and how it will be measured. Increasing the number of different measures in a study will increase construct validity provided that the measures are measuring the same construct Idea Program Construct Operationalization
Measurement validity types • Translation validity • Face validity • Content validity • Criterion-related validity • Predictive validity • Concurrent validity • Convergent validity • Discriminant validity
Idea of Construct Validity • Construct validity is an assessment of how well you translated your ideas or theories into actual programs or measures • Why Construct Validity is important? • truth in labeling
Convergent & Discriminant Validity • Convergent and discriminant validity are both considered subcategories or subtypes of construct validity • Convergent Validity • To establish convergent validity, we need to show that measures that should be related are in reality related
Discriminant Validity • To establish discriminant validity, you need to show that measures that should not be related are in reality not related
Threats to Construct Validity • Inadequate preoperational explication of constructs • Mono-operation bias • Mono-method bias • Interaction of different treatments • Interaction of testing and treatment • Restricted generalizability across constructs • Confounding constructs and levels of constructs • “Social” threats • Hypothesis guessing • Evaluation apprenhension • Experiment expectancy
Approaches to Assess Validity • Nomologicalnetwork • a theoretical basis • Multitrait-multimethodmatrix • to demonstrate both convergent and discriminant validity • Pattern matching
Nomologicalnetwork • It includes a theoretical framework for what you are trying to measure, an empirical framework for how you are going to measure it, and specification of linkages among and between these two frameworks • It does not provide a practical and usable methodology for actually assessing construct validity
Pattern matching • It is an attempt to link two patterns: theoretical pattern and observed pattern • It requires that • To specify your theory of the constructs precisely! • you structure the theoretical and observed patterns the same way so that you can Directly correlate them • Common Example • ANOVA table
Multitrait-multimethod matrix • MTMM is a matrix of correlations arranged to facilitate the assessment of construct validity • It is based on convergence and discriminant validity • It assumes that you have several concepts and several measurement methods, and you measure each concept by each method • To determine the strength of the construct validity: • Reliability coefficients should be the highest in the matrix • Coefficient in the validity diagonal should be significantly different from zero and high enough • The same pattern of trait interrelationship • should be seen in all triangles.