‘Class Exercise’ III: Application Project Evaluation

‘Class Exercise’ III: Application Project Evaluation Deborah McGuinness and Joanne Luciano with Peter Fox and Li Ding CSCI/ITEC-6962-01 Week 11, November 15, 2010 1

Contents • Review of reading, questions, comments • Evaluation • Summary • Next week 2

Semantic Web Methodology and Technology Development Process • Establish and improve a well-defined methodology vision for Semantic Technology based application development • Leverage controlled vocabularies, et c. Adopt Technology Approach Leverage Technology Infrastructure Science/Expert Review & Iteration Rapid Prototype Open World: Evolve, Iterate, Redesign, Redeploy Use Tools Evaluation Analysis Use Case Develop model/ ontology Small Team, mixed skills 3

References • Twidale, Randall and Bentley (1994) and references therein • Scriven (1991, 1996) • Weston, Mc Alpine, and Bordonaro, (1995) • Worthen, Sanders, and Fitzpatrick, (1997) 4

Inventory • What categories can you measure? • Users • Files • Databases • Catalogs • Existing UI capabilities (or lack thereof) • Services • Ontologies • In the stage of use case development is a very good time to capture these elements; do not guess, get them from quantitative sources or the users/ actors 5

Metrics • Things you can measure (numerical) • Things that are categorical • Could not do before • Faster, more complete, less mistakes, etc. • Wider range of users • Measure or estimate the baseline before you start 6

Result / Outcome • Refer to the use case document • Outcome (and value of it) is a combination of data gathering processes, including surveys, interviews, focus groups, document analysis and observations that will yield both qualitative and quantitative results. • Did you meet the goal? • Just listen… do not defend … if you start to then: QTIP – quit taking it personally 7

Example: what we wanted to know about VSTO • Evaluation questions are used to determine the degree to which the VSTO enhanced search, access, and use of data for scientific and educational needs and effectively utilized and implemented a template for user-centric utilization of the semantic web methodology. • VO – appears to local and integrated and in the end-users language (this is one of the metrics) 8

Evaluation (Twidale et al.) • An assessment of the overall effectiveness of a piece of software, ideally yielding a numeric measure by which informed cost-benefit analysis of purchasing decisions can be made. • An assessment of the degree to which the software fulfils its specification in terms of functionality, speed, size or whatever measures were pre-specified. 9

Evaluation • An assessment of whether the software fulfils the purpose for which it was intended. • An assessment of whether the ideas embodied in the software have been proved to be superior to an alternative, where that alternative is frequently the traditional solution to the problem addressed. • An assessment of whether the money allocated to a research project has been productively used, yielding useful generalizeable results. 10

Evaluation • An assessment of whether the software proves acceptable to the intended end-users. • An assessment of whether end-users continue to use it in their normal work. • An assessment of where the software fails to perform as desired or as is now seen to be desirable. • An assessment of the relative importance of the inadequacies of the software. 11

(Orthogonal) Dimensions of evaluations http://janus.ucc.nau.edu/edtech/etc667/proposal/evaluation/summative_vs._formative.htm 12

Formative and Summative • evaluation carried out for two reasons: • grading translations = summative evaluation • giving feedback = formative evaluation • “When the cook tastes the soup, that’s formative; when the guests taste the soup, that’s summative." (Stakes) 13

Formative and Summative 14

What if questions (qualitative) • could not only use your data and tools but remote colleague's data and tools? • understood their assumptions, constraints, etc. and could evaluate applicability? • knew whose research currently (or in the future) would benefit from your results? • knew whose results were consistent (or inconsistent) with yours? 15

Evaluation questions and associated data collection methods 16

Implementing an evaluation • Based on our experience with use case development and refinement, community engagement, and ontology vetting, a workshop format (6 up to 25 participants, depending on desired outcomes and scope) is a very effective mechanism to make rapid progress. • The workshops can be part of a larger meeting, stand-alone or partly virtual (via remote telecommunication). • We have found (for example, in our data integration work) that domain experts in particular are extremely willing to participate in these workshops. 19

Implementing • Let’s take an example • VSTO • Representative but does not exercise all semantic web capabilities 20

VSTO qualitative results • Decreased input requirements: The previous system required the user to provide 8 pieces of input data to generate a query and our system requires 3. Additionally, the three choices are constrained by value restrictions propagated by the reasoning engine. Thus, we have made the workflow more efficient and reduced errors (note the supportive user comments two paragraphs above) 21

VSTO qualitative results • Syntactic query support: The interface generates only syntactically correct queries. The previous interface allowed users to edit the query directly, thus providing multiple opportunities for syntactic errors in the query formation stage. As one user put it: “I used to do one query, get the data and then alter the URL in a way I thought would get me similar data but I rarely succeeded, now I can quickly re-generate the query for new data and always get what I intended”. 22

VSTO qualitative results • Semantic query support: By using background ontologies and a reasoner, our application has the opportunity to only expose query options that will not generate incoherent queries. Additionally, the interface only exposes options for example in date ranges for which data actually exists. This semantic support did not exist in the previous system. In fact we limited functionality in the old interface to minimize the chances of misleading or semantically incorrect query construction. 23

VSTO qualitative results • Semantic query support: for example, that a user has increased functionality – i.e., they can now initiate a query by selecting a class of parameter(s). As the query progresses, the sub-classes and/or specific instances of that parameter class are available as the datasets are identified later in the query process. 24

VSTO qualitative results • Semantic query support: We removed the parameter initiated search in the previous system because only the parameter instances could be chosen (8 different instances to represent neutral temperature, 18 representations of time, etc.) and it was too easy for the wrong one to be chosen, quickly leading to a dead-end query and frustrated user. One user with more than 5 years of CEDAR system experience noted: “Ah, at last, I’ve always wanted to be able to search this way and the way you’ve done it makes so much sense”. 25

VSTO qualitative results • Semantic integration: Users now depend on the ontologies rather than themselves to know the nuances of the terminologies used in varying data collections. Perhaps more importantly, they also can access information about how data was collected including the operating modes of the instruments used. “The fact that plots come along with the data query is really nice, and that when I selected the data it comes with the correct time parameter” (New graduate student, ~ 1 year of use). 26

VSTO qualitative results • Semantic integration: The nature of the encoding of time for different instruments means that not only are there 18 different parameter representations but those parameters are sometimes recorded in the prologue entries of the data records, sometimes in the header of the data entry (i.e. as metadata) and sometimes as entries in the data tables themselves. Users had to remember (and maintain codes) to account for numerous combinations. The semantic mediation now provides the level of sensible data integration required. 27

VSTO qualitative results • Broader range of potential users: VSTO is usable by people who do not have PhD level expertise in all of the domain science areas, thus supporting efforts including interdisciplinary research. The user population consists of: Student (under-graduate, graduate) and non-student (instrument PI, scientists, data managers, professional research associates). 28

VSTO quantitative results • Broader range of potential users: For CEDAR, students: 168, non-students: 337, for MLSO, students: 50, non-students: 250. In addition 36% and 25% of the users are non-US based (CEDAR – a 57% increase over the last year - and MLSO respectively). The relative percentage of students has increased by ~10% for both groups. 29

Adoption (circa 2007) • Currently there are on average between 80-90 distinct users authenticated via the portal and issuing 400-450 data requests per day, resulting in data access volumes of 100KB to 210MB per request. In the last year, 100 new users have registered, more than four times the number from the previous year. The users registered last year when the new portal was released, and after the primary community workshop at which the new VSTO system was presented. At that meeting, community agreement was given to transfer operations to the new system and move away from the existing one. 30

Facilitating new projects • At the community workshop a priority-area was identified which involved the accuracy and consistency of temperature measurements determined from instruments like the Fabry-Perot Interferometer. As a result, we have saw a 44% increase in data requests in that area. We increased the granularity in the related portion of the ontology to facilitate this study. 31

Facilitating new projects • We focused on improving a users’ ability to find related or supportive data, with which to evaluate the neutral temperatures under investigation. We are seeing an increase (10%) in other neutral temperature data accesses, which we believe is a result of this related need. 32

Informal evaluation • We conducted an informal user study asking three questions: What do you like about the new searching interface? Are you finding the data you need? What is the single biggest difference? Users were already changing the way they search for and access data. Anecdotal evidence indicated that users are starting to think at the science level of queries, rather than at the former syntactic level. 33

Informal evaluation • For example, instead of telling a student to enter a particular instrument and date/time range and see what they get, they are able to explore physical quantities of interest at relevant epochs where these quantities go to extreme values, such as auroral brightness at a time of high solar activity (which leads to spectacular auroral phenomena). This suggested to us some new use cases to support even greater semantic mediation 34

Further measuring • One measure that we hoped to achieve is to have usage by all levels of domain scientist – from the PI to the early level graduate student. Anecdotal evidence shows this is happening and self classification also confirms the distribution. A scientist doing model/observational comparisons: noted “took me two passes now, I get it right away”, “nice to have quarter of the options”, and “I am getting closer to 1 query to 1 data retrieval, that’s nice”. 35

Focus group • A one hour workshop was held at the annual community meeting on the day after the main plenary presentation for VSTO. The workshop was very well attended with 35 diverse participants (25 were expected) ranging from a number senior researchers, junior researchers, post-doctoral fellows and students - including 3 that had just started in the field. • After some self-introductions eight questions were posed and responses recorded, some by count (yes/no) or comment. Overall responses ranged from 5 to 35 per question. 36

VSTO quantitative results • How do you like to search for data? Browse, type a query, visual? Responses: 10; Browse=7, Type=0, Visual=3. • What other concepts are you interested in using for search, e.g. time of high solar activity, campaign, feature, phenomenon, others? Responses: 5; all of these, no others were suggested. • Does the interface and its services deliver the functionality, speed, flexibility you require? Responses: 30; Yes=30, No=0. 37

VSTO quantitative results • Are you finding the data you need? Responses: 35; Yes=34, No=1. • How often do you use the interface in your normal work? Responses: 19; Daily=13, Monthly=4, Longer=2. • Are there places where the interface/ services fail to perform as desired? Responses: 5; Yes=1, No=4. 38

Qualitative questions • What do you like about the new searching interface? Responses: 9. • What is the single biggest difference? Responses: 8. • The general answers were as follows: • Less clicks to data (lots) • Auto identification and retrieval of independent variables (lots) • Faster (lots) • Seems to converge faster (few) 39

Unsolicited/ unstructured comments • It makes sense now! • [I] Like the plotting. • Finding instruments I never knew about. • Descriptions are very handy. • What else can you add? • How about a python interface [to the services]? 40

Surprise! New use cases • The need for a programming/ script level interface, i.e. building on the services interfaces; in Python, Perl, C, Ruby, Tcl, and 3 others. • Addition of models alongside observational data, i.e. find data from observations/ models that are comparable and/or compatible. • More services (particularly plotting options - e.g. coordinate transformation - that are hard to add without detailed knowledge of the data). 41

Other examples • CALO – Trust studies • Alyssa Glass, Deborah L. McGuinness, Paulo Pinheiro da Silva, and Michael Wolverton. Trustable Task Processing Systems. In Roth-Berghofer, T., and Richter, M.M., editors, KI Journal, Special Issue on Explanation, Kunstliche Intelligenz, 2008. • NIMD – Intelligence Analyst Study • Andrew. J. Cowell, Deborah L. McGuinness, Carrie F. Varley, and David A. Thurman. Knowledge-Worker Requirements for Next Generation Query Answering and Explanation Systems. In the Proceedings of the Workshop on Intelligent User Interfaces for Intelligence Analysis, International Conference on Intelligent User Interfaces (IUI 2006), Sydney, Australia. abstract 42

Keep in mind • You need an evaluation plan that can lead to improvements in what you have built • You need an evaluation to value what you have built • You need an evaluation as part of your publication (and thesis) 43

Iterating • Evolve, iterate, re-design, re-deploy • Small fixes • Full team must be briefed on the evaluation results and implications • Decide what to do about the new use cases, or if the goal is not met • Determine what knowledge engineering is required and who will do it (often participants in the evaluation may become domain experts in your methodology) • Determine what new knowledge representation • Assess need for an architectural re-design 44

Summary • Project evaluation has many attributes • Structured and less-structured • Really need to be open to all forms • A good way to start is to get members of your team to do peer evaluation • This is a professional exercise, treat it that way at all times • Other possible techniques for moving forward on evolving the design, what to focus upon, priorities, etc.: SWOT, Porter’s 5 forces 45

Next week • This weeks assignments: • Reading: no reading • Next class (week 11 – November 22): • Team Use Case Implementation • Office hours this week – • Questions? 46

‘Class Exercise’ III: Application Project Evaluation