490 likes | 611 Views
CSA4080: Adaptive Hypertext Systems II. Topic 8: Evaluation Methods. Dr. Christopher Staff Department of Computer Science & AI University of Malta. Aims and Objectives. Background to evaluation methods in user-adaptive systems
E N D
CSA4080:Adaptive Hypertext Systems II Topic 8: Evaluation Methods Dr. Christopher Staff Department of Computer Science & AI University of Malta 1 of 49 cstaff@cs.um.edu.mt
Aims and Objectives • Background to evaluation methods in user-adaptive systems • Brief overviews of the evaluation of IR, QA, User Modelling, Recommender Systems, Intelligent Tutoring Systems, Adaptive Hypertext Systems 2 of 49 cstaff@cs.um.edu.mt
Background to Evaluation Methods • Systems need to be evaluated to demonstrate (prove) that the hypothesis on which they are based is correct • In IR, we need to know that the system is retrieving all and only relevant documents for the given query 3 of 49 cstaff@cs.um.edu.mt
Background to Evaluation Methods • In QA, we need to know the correct answer to questions, and measure performance • In User Modelling, we need to determine that the model is an accurate reflection of information needed to adapt to the user • In Recommender Systems, we need to associate user preferences either with other similar users, or with product features 4 of 49 cstaff@cs.um.edu.mt
Background to Evaluation Methods • In Intelligent Tutoring Systems we need to know that learning through an ITS is beneficial or at least not (too) harmful • In Adaptive Hypertext Systems, we need to measure the system’s ability to automatically represent user interests, to direct the user to relevant information, and to present the information in the best way 5 of 49 cstaff@cs.um.edu.mt
Measuring Performance • Information Retrieval: • Recall and Precision (overall, and also at top-n) • Question Answering: • Mean Reciprocal Rank 6 of 49 cstaff@cs.um.edu.mt
Measuring Performance • User Modelling • Precision and Recall: if user is given all and only relevant info, or if system behaves exactly as user needs, then model is probably correct • Accuracy and predicted probability: to predict a user’s actions, location, or goals • Utility: the benefit derived from using system 7 of 49 cstaff@cs.um.edu.mt
Measuring Performance • Recommender Systems: • Content-based may be evaluated using precision and recall • Collaborative is harder to evaluate, because it depends on other users the system knows about • Quality of individual item prediction • Precision and Recall at top-n 8 of 49 cstaff@cs.um.edu.mt
Measuring Performance • Intelligent Tutoring Systems: • Ideally, being able to show that student can learn more efficiently using ITS than without • Usually, show that no harm is done • Then, “releasing the tutor” and enabling self-paced learning becomes a huge advantage • Difficult to evaluate • Cannot compare same student with and without ITS • Students who volunteer are usually very motivated 9 of 49 cstaff@cs.um.edu.mt
Measuring Performance • Adaptive Hypertext Systems: • Can mix UM, IR, RS (content-based) methods of evaluation • Use empirical approach • Different sets of users solve same task, one group with adaptivity, the other without • How to choose participants? 10 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • IR systems’ performance is normally measured using precision and recall • Precision: percentage of retrieved documents that are relevant • Recall: percentage of relevant documents that are retrieved • Who decides which documents are relevant? 11 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • Query Relevance Judgements • For each test query, the document collection is divided into two sets: relevant and non-relevant • Systems are compared using precision and recall • In early collections, humans would classify documents (p3-cleverdon.pdf) • Cranfield collection: 1400 documents/221 queries • CACM: 3204 documents/50 queries 12 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • Do humans always agree on relevance judgements? • No: can vary considerably (mizzaro96relevance.pdf) • So only use documents on which there is full agreement 13 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • TExt Retrieval Conference (TREC) (http://trec.nist.gov) • Runs competitions every year • QRels and document collection made available in a number of tracks (e.g., ad hoc, routing, question answering, cross-language, interactive, Web, terabyte, ...) 14 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • What happens when collection grows? • E.g., Web track has 1GB of data! Terabyte track in the pipeline • Pooling • Give different systems same document collection to index and queries • Take the top-n retrieved documents from each • Documents that are present in all retrieved sets are relevant, others not OR • Assessors judge the relevance of unique documents in the pool 15 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • Advantages: • Possible to compare system performance • Relatively cheap • QRels and document collection can be purchased for moderate price rather than organising expensive user trials • Can use standard IR systems (e.g., SMART) and build another layer on top, or build new IR model • Automatic and Repeatable 16 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • Common criticisms: • Judgements are subjective • Same assessor may change judgement at different times! • Doesn’t effect ranking • Judgements are binary • Some relevant documents are missed by pooling (QRels are incomplete) • Doesn’t effect system performance 17 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • Common criticisms (contd.): • Queries are too long • Queries under test conditions can have several hundred terms • Average Web query length 2.35 terms (p5-jansen.pdf) 18 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: IR • In massive document collections there may be hundreds, thousands, or even millions of relevant documents • Must all of them be retrieved? • Measure precision at top-5, 10, 20, 50, 100, 500 and take weighted average over results (Mean Average Precision) 19 of 49 cstaff@cs.um.edu.mt
The E-Measure Combine Precision and Recall into one number (http://www.dcs.gla.ac.uk/Keith/Chapter.7/Ch.7.html) P = precision R = recall b = measure of relative importance of P or R E.g, b = 0.5 means user is twice as interested in precision as recall 20 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: QA • The aim in Question Answering is not to ensure that the overwhelming majority of relevant documents are retrieved, but to return an accurate answer • Precision and recall are not accurate enough • Usual measure is Mean Reciprocal Rank 21 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: QA • MRR measures the average rank of the first correct answer for each query (1/rank, or 0 if correct answer is not in top-5) • Ideally, the first correct answer is put into rank 1 qa_report.pdf 22 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: UM • Information Retrieval evaluation has matured to the extent that it is very unusual to find an academic publication without a standard approach to evaluation • On the other hand, up to 2001, only one-third of user models presented in UMUAI had been evaluated: and most of those were ITS related (see later) p181-chin.pdf 23 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: UM • Unlike IR systems, it is difficult to evaluate UMs automatically • Unless they are stereotypes/course-grained classification systems • So they tend to need to be evaluated empirically • User studies • Want to measure how well participants do with and without a UM supporting their task 24 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: UM • Difficulties/problems include: • Ensuring a large enough number of participants to make results statistically meaningful • Catering for participants improving during rounds • Failure to use a control group • Ensuring that nothing happens to modify participant’s behaviour (e.g., thinking aloud) 25 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: UM • Difficulties/problems (contd.): • Biasing the results • Not using blind-/double-blind testing when needed • ... 26 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: UM • Proposed reporting standards • No., source, and relevant background of participants • independent, dependent and covariant variables • analysis method • post-hoc probabilities • raw data (in the paper, or on-line via WWW) • effect size and power (at least 0.8) p181-chin.pdf 27 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: RS • Recommender Systems • Two types of recommender system • Content-based • Collaborative • Both (tend to) use VSM to plot users/ product features into n-dimensional space 28 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: RS • If we know the “correct” recommendations to make to a user with a specific profile, then we can use Precision, Recall, EMeasure, Fmeasure, Mean Average Precision, MRR, etc. 29 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • Intelligent Tutoring Systems • Evaluation to demonstrate that learning through ITS is at least as effective as traditional learning • Cost benefit of freeing up tutor, and permitting self-paced learning • Show at a minimum that student is not harmed at all or is minimally harmed 30 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • Difficult to “prove” that individual student learns better/same/worse with ITS than without • Cannot make student unlearn material in between experiments! • Attempt to use statistically significant number of students, to show probable overall effect 31 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • Usually suffers from same problems as evaluating UMs, and ubiquitous multimedia systems • Students volunteer to evaluate ITSs • So are more likely to be motivated and so perform better • Novelty of system is also a motivator • Too many variables that are difficult to cater for 32 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • However, usually empirical evaluation is performed • Volunteers work with system • Pass rates, retention rates, etc., may be compared to conventional learning environment (quantitative analysis) • Volunteers asked for feedback about, e.g., usability (qualitative analysis) 33 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • Frequently, students are split into groups (control and test) and performance measured against each other • Control is usually ITS without the I - students must find their own way through learning material • However, this is difficult to assess, because performance of control group may be worse than traditional learning! 34 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: ITS • “Learner achievement” metric (Muntean, 2004) • How much has student learnt from ITS? • Compare pre-learning knowledge to post-learning knowledge • Can compare different systems (as long as they use same learning material), but with different users: so same problem as before 35 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Adaptive Hypertext Systems • There are currently no standard metrics for evaluating AHSs • Best practices are taken from fields like ITS, IR, and UM and applied to AHS • Typical evaluation is “experiences” of using system with and without adaptive features 36 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • If a test collection existed for AHS (like TREC) what might it look like? • Descriptions of user models + relevance judgements for relevant links, relevant documents, relevant presentation styles • Would we need a standard “open” user model description? Are all user models capturing the same information about the user? 37 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • What about following paths through hyperspace to pre-specified points and then having the sets of judgements? • Currently, adaptive hypertext systems appear to be performing very different tasks, but even if we take just one of the two things that can be adapted (e.g., links), it appears to be beyond our current ability to agree on how adapting links should be evaluated, mainly due to UM! 38 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • HyperContext (HCT) (HCTCh8.pdf) • HCT builds a short-term user model as a user navigates through hyperspace • We evaluated HCT’s ability to make “See Also” recommendations • Ideally, we would have had hyperspace with independent relevance judgements a particular points in path of traversal 39 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Instead, we used two mechanisms for deriving UM (one using interpretation, the other using whole document) • After 5 link traversals we automatically generated a query from each user model, submitted it to search engine and found a relevant interpretation/document respectively 40 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Users asked to read all documents in the path and then give relevance judgement for each “See Also” recommendation • Recommendations shown in random order • Users didn’t know which was HCT recommended and which was not • Assumed that if user considered doc to be relevant, then UM is accurate 41 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Not really enough participants to make strong claims about HCT approach to AH • Not really significant differences in RJs between different ways of deriving UM (although both performed reasonably well!) • However, significant findings if reading time is indication of skim-/deep-reading! 42 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Should users have been shown both documents? • Could reading two documents, instead of just one, have effected judgement of doc read second? • Were users disaffected because it wasn’t a task that they needed to perform? 43 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • Ideally, systems are tested in “real world” conditions in which evaluators are performing tasks • Normally, experimental set-ups require users to perform artificial tasks, and it is difficult to measure performance because relevance is subjective! 44 of 49 cstaff@cs.um.edu.mt
Evaluation Methods: AHS • This is one of the criticisms of the TREC collections, but it does allow systems to be compared - even if the story is completely different once the system is in real use • Building a robust enough system for use in the real world is expensive • But then, so is conducting lab based experiments 45 of 49 cstaff@cs.um.edu.mt
Modular Evaluation of AUIs • Adaptive User Interfaces, or User-Adaptive Systems • Difficult to evaluate “monolithic” systems • So break up UAS’s into “modules” that can be evaluated separately 46 of 49 cstaff@cs.um.edu.mt
Modular Evaluation of AUIs • Paramythis, et. al. recommend • identifying the “evaluation objects” - that can be evaluated separately and in combination • presenting the “evaluation purpose” - the rationale for the modules and criteria for their evaluation • identifying the “evaluation process” - methods and techniques for evaluating modules during the AUI life cycle paramythis.pdf 47 of 49 cstaff@cs.um.edu.mt
Modular Evaluation of AUIs 48 of 49 cstaff@cs.um.edu.mt
Modular Evaluation of AUIs 49 of 49 cstaff@cs.um.edu.mt