1 / 28

Evaluation

Evaluation. Evaluation. Personal evaluation Software validation Software evaluation. Personal evaluation. What have I achieved? Have I achieved what I set out to achieve? Where have I fallen short? Why? What could I have done better?

apu
Download Presentation

Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation

  2. Evaluation • Personal evaluation • Software validation • Software evaluation

  3. Personal evaluation • What have I achieved? • Have I achieved what I set out to achieve? • Where have I fallen short? • Why? • What could I have done better? • Assumes an a priori statement of what you hope/expect/intend to achieve

  4. Dissertation plan Introduction Background Success criteria Design Realisation Evaluation/Testing Conclusions & Further Work Ch 3 lays out success criteria by which success of project is to be judged Ch 6 will review work done in Ch 5 with respect to these criteria, including reflection on overall validity of the approach But this is not “software evaluation” Self evaluation in your dissertation

  5. Program validation • Systematically check all functions in your program/application • Systematically check all sequences of inputs etc. • Does your program/application do what you think it is supposed to do? • This is important, but ... • This is not “software evaluation”

  6. Software evaluation Note: We are using the term “software” in a very vague sense: it could include a program, a web application, any sort of implementation that does something • Evaluate the appropriateness of the software with respect to its intended use • Large range of aspects of software that can be evaluated

  7. Evaluation evaluation • In your dissertation you are asked to evaluate what you have achieved • Your research could (should?) include an evaluation element • So you will need to evaluate your evaluation • Your evaluation might have negative results, but still be an informative experiment which you can evaluate positively • Your research could even be to compare evaluation schemes!

  8. A case study • Last year a student of mine did a project which was a comparative evaluation of a number of speech synthesis devices • His dissertation discussed • Factors in setting up a comparative evaluation • A description of the actual evaluation • A discussion of the results • His personal evaluation then considered how well the experiment (i.e. the evaluation) had been conducted

  9. Software evaluation • Functionality – does it do what is supposed to do? • Reliability – does it do the same thing under the same conditions? • Usability – is it user-friendly? • Efficiency – cost, speed, etc. • Maintainability – can you modify it? Is it robust? • Portability – can it be transferred from one environment/platform to another?

  10. Software evaluation • Evaluating commercial software is different from evaluating something you have constructed • Even if you have constructed it from commercially available components • Again, note the difference between validation and evaluation • Especially concerning “functionality” • Also, evaluation not the same as a software review, as found eg in a magazine

  11. Stakeholders • Developers • Researchers • Commercial developers • End-users • Actual end-users (is this a single type?) • Their managers (buyers) • Vendors • Investors

  12. Evaluation types • Feasibility / Suitability • For any of the above stakeholders • Internal evaluation • For development • Iterative testing, to evaluate progress • Adequacy evaluation • Diagnostic evaluation (debugging) • Black box vs. glass box evaluation

  13. Evaluation types • Declarative evaluation • How well does it perform? • Comparison with a “gold standard” ideal performance • Comparison with a baseline “wooden block” • Usability evaluation • How long does each step take? • Is it “natural”, intuitive? • Is it easy to learn to use? • Is it well documented?

  14. Evaluation types • Operational evaluation • ROI • Compatibility with other software • Consistency of interfaces • Internal • With respect to “standards” (eg Microsoft) • Failsofts • Role of humans • Preparation, throughput, correction, output • Backup • Documentation • Support • Corporate situation of provider

  15. Framework for evaluation • Definition of the relevant quality characteristics – what is it you want to evaluate? Be specific • Definition of attributes pertinent to this quality • Definition of a measure able to provide values for these attributes • Definition of a method whereby the measure can be made

  16. Framework for evaluation Important to be sure that • The quality to be evaluated is genuinely a quality that is claimed of the software • The attribute to be measured does reflect the quality in question • The measure does genuinely measure that attribute (and not some other one) • The method is sufficient to deliver a meaningful measure

  17. Example: spell checker • Function: • (a) identify wrongly-spelled word • (b) suggest an appropriate correction • (among other features) • Quality: ability to do (a) • Attribute: success rate in performance of that task • Measure: “Precision”: percentage of wrongly-spelled words correctly identified in a document • Method: give it a text with some wrongly-spelled words and count how many it spots

  18. Example: spell checker • Good evaluation, but not A* • Success means • Identifying misspelled words (true positives) • Ignoring correctly spelled words (true negatives) • So is the measure really appropriate? We are only counting true positives and false negatives: we are not giving credit for the true negatives, nor penalising false positives • The method is underspecified: • How much text? • What sort of text? • Should we take into account what we know about spell checking (a certain class of error is very hard to detect)? • Should we classify misspellings and measure different classes separately?

  19. Attributes • Different types imply different measures/methods • Example: dish-washers * a = pre-wash rinse cycle; b = independent rinse cycle

  20. Methods and measures • Objective measures • Measuring, counting, timing • Doing a specific task • In case of usability issues, need to evaluate with a number of subjects (not just do it yourself) • Comparison against a gold standard • Precision • Recall • Other measures also considering false positives and negatives

  21. Methods and measures • Subjective measures • Interview after use • Feedback questionnaire • Rating scales (usually 5 or 7 points, + DK, N/A) • Open-ended questions? • Questions should relate to some specific point • Repeat (some) questions in a disguised way • Performance analysis • Video the session, analyse afterwards

  22. Methods and measures • Don’t try to measure too many different things with the same instrument • Though this can be possible to some extent • But extraneous factors need to be controlled carefully • Problem of statistical significance: • Do you have enough subjects to know that the differences (and similarities) are not just random fluctuations?

  23. Example • Simulated doctor-patient interviews with patients with limited English, using computer-based communication device with symbols and digitised speech • two devices (laptop+mousepad, tablet+stylus) • doctors and nurses • literate and illiterate patients

  24. Example • General question: could they get to the end of the consultation? (How did we “measure” this?) • Objective measures • How long did it take? • How many questions did they ask? • How many answers were (apparently) correctly understood? • Subjective measures • Feedback questionnaire with satisfaction ratings • Open-ended questions about specific issues

  25. Subjects • Many types of evaluation require volunteers • How many do you need? • Where will you get them from? • Are they suitable? • Exclusion factors: eg prior familiarity with your topic • Need to control for irrelevant differences in their profile • How will you guarantee their cooperation? • Ethical issues • Officially, you need ethics clearance for any experiments involving living beings! • In any case, important that volunteers know what they are letting themselves in for • Also important that you don’t waste people’s time, eg evaluating a useless task (for example as a baseline)

  26. Summary • What are you trying to evaluate? • Be specific, not general eg “What do you think of this interface?” • What is the best way to measure what you are interested in? • How feasible is it to do what you want? • [After Easter]: How to write it all up!

  27. Next session • No class next week • First week after Easter (19 Apr) • No class on Thursday • Instead, practical sessions on Library Resources with Barry White • choose one of three sessions • each at 2pm-4pm • Wed 18, Thur 19 or Fri 20 April • in the Joule Library • Do we need a sign-up sheet?

More Related