280 likes | 414 Views
Evaluation. Evaluation. Personal evaluation Software validation Software evaluation. Personal evaluation. What have I achieved? Have I achieved what I set out to achieve? Where have I fallen short? Why? What could I have done better?
E N D
Evaluation • Personal evaluation • Software validation • Software evaluation
Personal evaluation • What have I achieved? • Have I achieved what I set out to achieve? • Where have I fallen short? • Why? • What could I have done better? • Assumes an a priori statement of what you hope/expect/intend to achieve
Dissertation plan Introduction Background Success criteria Design Realisation Evaluation/Testing Conclusions & Further Work Ch 3 lays out success criteria by which success of project is to be judged Ch 6 will review work done in Ch 5 with respect to these criteria, including reflection on overall validity of the approach But this is not “software evaluation” Self evaluation in your dissertation
Program validation • Systematically check all functions in your program/application • Systematically check all sequences of inputs etc. • Does your program/application do what you think it is supposed to do? • This is important, but ... • This is not “software evaluation”
Software evaluation Note: We are using the term “software” in a very vague sense: it could include a program, a web application, any sort of implementation that does something • Evaluate the appropriateness of the software with respect to its intended use • Large range of aspects of software that can be evaluated
Evaluation evaluation • In your dissertation you are asked to evaluate what you have achieved • Your research could (should?) include an evaluation element • So you will need to evaluate your evaluation • Your evaluation might have negative results, but still be an informative experiment which you can evaluate positively • Your research could even be to compare evaluation schemes!
A case study • Last year a student of mine did a project which was a comparative evaluation of a number of speech synthesis devices • His dissertation discussed • Factors in setting up a comparative evaluation • A description of the actual evaluation • A discussion of the results • His personal evaluation then considered how well the experiment (i.e. the evaluation) had been conducted
Software evaluation • Functionality – does it do what is supposed to do? • Reliability – does it do the same thing under the same conditions? • Usability – is it user-friendly? • Efficiency – cost, speed, etc. • Maintainability – can you modify it? Is it robust? • Portability – can it be transferred from one environment/platform to another?
Software evaluation • Evaluating commercial software is different from evaluating something you have constructed • Even if you have constructed it from commercially available components • Again, note the difference between validation and evaluation • Especially concerning “functionality” • Also, evaluation not the same as a software review, as found eg in a magazine
Stakeholders • Developers • Researchers • Commercial developers • End-users • Actual end-users (is this a single type?) • Their managers (buyers) • Vendors • Investors
Evaluation types • Feasibility / Suitability • For any of the above stakeholders • Internal evaluation • For development • Iterative testing, to evaluate progress • Adequacy evaluation • Diagnostic evaluation (debugging) • Black box vs. glass box evaluation
Evaluation types • Declarative evaluation • How well does it perform? • Comparison with a “gold standard” ideal performance • Comparison with a baseline “wooden block” • Usability evaluation • How long does each step take? • Is it “natural”, intuitive? • Is it easy to learn to use? • Is it well documented?
Evaluation types • Operational evaluation • ROI • Compatibility with other software • Consistency of interfaces • Internal • With respect to “standards” (eg Microsoft) • Failsofts • Role of humans • Preparation, throughput, correction, output • Backup • Documentation • Support • Corporate situation of provider
Framework for evaluation • Definition of the relevant quality characteristics – what is it you want to evaluate? Be specific • Definition of attributes pertinent to this quality • Definition of a measure able to provide values for these attributes • Definition of a method whereby the measure can be made
Framework for evaluation Important to be sure that • The quality to be evaluated is genuinely a quality that is claimed of the software • The attribute to be measured does reflect the quality in question • The measure does genuinely measure that attribute (and not some other one) • The method is sufficient to deliver a meaningful measure
Example: spell checker • Function: • (a) identify wrongly-spelled word • (b) suggest an appropriate correction • (among other features) • Quality: ability to do (a) • Attribute: success rate in performance of that task • Measure: “Precision”: percentage of wrongly-spelled words correctly identified in a document • Method: give it a text with some wrongly-spelled words and count how many it spots
Example: spell checker • Good evaluation, but not A* • Success means • Identifying misspelled words (true positives) • Ignoring correctly spelled words (true negatives) • So is the measure really appropriate? We are only counting true positives and false negatives: we are not giving credit for the true negatives, nor penalising false positives • The method is underspecified: • How much text? • What sort of text? • Should we take into account what we know about spell checking (a certain class of error is very hard to detect)? • Should we classify misspellings and measure different classes separately?
Attributes • Different types imply different measures/methods • Example: dish-washers * a = pre-wash rinse cycle; b = independent rinse cycle
Methods and measures • Objective measures • Measuring, counting, timing • Doing a specific task • In case of usability issues, need to evaluate with a number of subjects (not just do it yourself) • Comparison against a gold standard • Precision • Recall • Other measures also considering false positives and negatives
Methods and measures • Subjective measures • Interview after use • Feedback questionnaire • Rating scales (usually 5 or 7 points, + DK, N/A) • Open-ended questions? • Questions should relate to some specific point • Repeat (some) questions in a disguised way • Performance analysis • Video the session, analyse afterwards
Methods and measures • Don’t try to measure too many different things with the same instrument • Though this can be possible to some extent • But extraneous factors need to be controlled carefully • Problem of statistical significance: • Do you have enough subjects to know that the differences (and similarities) are not just random fluctuations?
Example • Simulated doctor-patient interviews with patients with limited English, using computer-based communication device with symbols and digitised speech • two devices (laptop+mousepad, tablet+stylus) • doctors and nurses • literate and illiterate patients
Example • General question: could they get to the end of the consultation? (How did we “measure” this?) • Objective measures • How long did it take? • How many questions did they ask? • How many answers were (apparently) correctly understood? • Subjective measures • Feedback questionnaire with satisfaction ratings • Open-ended questions about specific issues
Subjects • Many types of evaluation require volunteers • How many do you need? • Where will you get them from? • Are they suitable? • Exclusion factors: eg prior familiarity with your topic • Need to control for irrelevant differences in their profile • How will you guarantee their cooperation? • Ethical issues • Officially, you need ethics clearance for any experiments involving living beings! • In any case, important that volunteers know what they are letting themselves in for • Also important that you don’t waste people’s time, eg evaluating a useless task (for example as a baseline)
Summary • What are you trying to evaluate? • Be specific, not general eg “What do you think of this interface?” • What is the best way to measure what you are interested in? • How feasible is it to do what you want? • [After Easter]: How to write it all up!
Next session • No class next week • First week after Easter (19 Apr) • No class on Thursday • Instead, practical sessions on Library Resources with Barry White • choose one of three sessions • each at 2pm-4pm • Wed 18, Thur 19 or Fri 20 April • in the Joule Library • Do we need a sign-up sheet?