Lecture 01: Introduction

Lecture 01:Introduction September 5, 2012 COMP 250-2Visual Analytics and Provenance

Motivation:What is in a User’s Interactions? Keyboard, Mouse, etc • Types of Human-Visualization Interactions • Text editing (input heavy, little output) • Browsing, watching a movie (output heavy, little input) • Visual Analysis (closer to 50-50) Input Visualization Human Output Images (monitor)

Provenance: Definitions • Provenance (according to Webster): • origin, source • the history of ownership of a valued object or work of art or literature • Example: Has anyone traced the provenances of these paintings?

Provenance: Computer Science • Examples in CS relating to provenance: • Undo / Redo • Diff / Revision history • History logging (e.g., Wikipedia) • Interaction logging • Data annotation • Database queries and results (Oracle Flashback) • Screen capturing? • Knowledge management • Etc.

Problems and Challenges • There are lots of people working on problems pertaining to provenance. For example: • Open Provenance Model (OPM): common framework for describing information provenance (e.g., Wikipedia) • Database: (1) how data is derived from large computation, (2) how data is copied / synthesized from one part of the database to another • Knowledge management / Ontology: the study of relationships between data, concepts, processes, information, etc. (e.g., Scientific Workflow)

What is “Analytic Provenance”? • Analytic Provenance (AP): similar to workflow, is the provenance of the analysis (the steps of the analysis). This includes: • The record of the data and data manipulation • The record of the user’s interactions • The record of what the user sees • The record of the computational products (both the visualization and the data) • The record of the user’s analysis results • How is provenance research in visual analytics similar and different from the others?

Similarities • At a high level, both are attempts at keeping track of the changes of data and information over time. • Along the same line, one common goal is to verify and validate the current information or the results of a query / analysis. • Others?

Differences • Exploration with high interactivity. • Many provenance systems utilize state diagrams • Similar to depicting “scientific workflow” • What is a “state” in a highly exploratory system? • Often, the “states” are data dependent • Computation can be costly • Storing only the procedures do not always lead to the same result due to computational powers • Storing of “insight” • How to determine which interactions are useful? (the labeling problem…)

Why Study Analytic Provenance? • Damn good question… • A problem that has plagued me for years • Validation (of the results) –kind of lame • Verification (of the process) – kind of lame • Training – Less lame, but still not very sexy • These can all be (somewhat) solved with video-capturing the screen (with some annotation and/or microphone)

Why Study Analytic Provenance? • More interestingly, as a post hoc analysis: • Redo of a certain task • For example, I just did some analysis on the Census of Massachusetts. Can I reapply the same analysis to Rhode Island Census? • What analysis steps are useful? • For example, in solving a difficult problem, is there a key step (or steps) that lead to success (and to failure)? • Building a knowledge-base • If we believe that interactions encodes reasoning and analysis, can I aggregate all interactions/analyses into a knowledge-base?

Why Study Analytic Provenance? • As an ad hoc analysis (that is, in real time), determine a user’s “analytical needs”: • Related to “adaptive” visualization or interfaces, but goes beyond adapting the UI. For example: • Analysis of search space: what data space has the user explored (and not explored)? • Analysis of bias: is the user favoring some hypothesis and ignoring evidence? • Essentially, anything that can give us some indication of who the user is (in terms of “ICD3”)

Challenges? • What are some challenges that need to be solved in order to accomplish the previously stated goals? • How would you categorize them? • CHI 2010 Analytic Provenance workshop

Questions?

Provenance and Scientific Workflows:Challenges and Opportunities • Susan B.Davidson (Upenn) • Juliana Freire (Utah/NYU) • SIGMOD 2008

Prospective Provenance: Steps taken in the scientific workflow • Retrospective Provenance: • The environment in which the steps were taken • Annotated Provenance: • Some notes entered by the user • Note the small squares in the upper left corner of each box (representing each step)

Ways to Store Provenance • RDF (resource description file) • SPARQL (query language for RDF)

Ways to Store Provenance • OWL (Web Ontology Language)

Yuck… • Knowledge vs. information vs. data • Example about grass, rabbit, and wolf

Opportunities • Provenance and Scientific Publications: • Reproducibility for scientific experiments • Provenance and Data Exploration: • Simplify exploratory processes (graph reduction) • Social Network Analysis: • Not sure how this is different from 1, but applied to SNA (and crowd sourcing) • Provenance in Education: • “Teaching is one of the killer applications of provenance-enabled workflow systems”

User starts with an analogy template (left) • The user then reapplies the workflow to a different dataset (right). The system automatically figures out inconsistencies or problems with the reapplication and either (a) fixes it or (b) warns the user

Open Problems • Information Management Infrastructure: • Usability issue relating to how to use information management systems • Provenance Analytics and Visualization: • Visualize and data-mine provenance. • Interoperability: • Steps of workflow generated from different software and computers • Connecting Database and Workflow: • Does provenance make sense without access to the (a) original or the (b) derived data

Questions?

Date and Time • When and where should we meet??

Structure of the Course • Open-ended • The topic of provenance is both old and new… • Read existing papers (that I think are important), and • You suggest new ones • Identify research opportunities • Daily schedule (tentative): • 1:30 – 2:30 3-4 presentations of 15 minutes each • 2:30 – 2:45 break • 2:45 – 4:00 group discussion

Time Line • See course website • http://www.cs.tufts.edu/comp/250VA/ • Sign up for: • 2 scribes positions (on different weeks) • 2 papers to present (on different weeks) • And on those weeks, you will lead the discussions • 1 week where your group (of 3) will identify papers and lead discussions

Questions?

Lecture 01: Introduction