Capturing User Perspective on Data Quality in E-Science

Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org

Integration of public data (in biology) UniProt GenBank EnsEMBL Entrez dbSNP • Large volumes of data in many public repositories • Increasingly creative uses for this data • Their quality is largely unknown

Quality of e-science data Defining quality can be challenging: • In-silico experiments express cutting-edge research • Experimental data liable to change rapidly • Definitions of quality are themselves experimental • Scientists’ quality requirements often just a hunch • Quality tests missing or based on experimental heuristics • Often implicit and embedded in the experiment  not reusable A data consumer’s view on quality: Criteria for data acceptability within a specific data processing context

Example: protein identification Remove likely false positives  Improve prediction accuracy Quality filtering Goal: to explicitly define and automatically add the additional filtering step in a principled way “Wet lab” experiment Support evidence: provenance metadata Data output Reference databases Protein identification algorithm Protein Hitlist Protein function prediction

Our goals Offer e-scientists a principled way to: • Discover quality definitions for specific data domains • Make them explicit using a formal model • Implement them in their data processing environment • Test them on their data … in an incremental refinement cycle Benefits: • Automated processing • Reusability • “plug-in” quality components

Approach • Qurator • architectural framework: • runtime environment • data-specific quality services Research hypothesis: adding quality to data can be made cost-effective • By separating out generic quality processing from domain-specific definitions Define abstract quality views on the data Map quality view to an executable process Execute quality views

Abstract quality view model Quality Metadata Evidence Assertions Classification1 Class space 1 e3 C11 C12 … Classification2 e2 Class space 2 … Coverage C21 C22 e1 PeptidesCount Conditions: regions specification Data annotation         Actions on regions  Data    

Semantic model for quality concepts Evidence Meta-data model (RDF) Quality “upper ontology” (OWL) Quality evidence types Evidence annotations are class instances

Quality hypotheses discovery and testing Targeted Compilation Compilation Compilation • Multiple target environments: • Workflow • query processor Deployment Deployment Execution on test data Deployment Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Performance assessment Target-specific Quality component abstract quality view

Generic quality process pattern Collect evidence <variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables> - Fetch persistent annotations - Compute on-the-fly annotations Persistent evidence <QualityAssertion serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass" Compute assertions Classifier Classifier Classifier <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Evaluate conditions Execute actions

Bindings: assertion  service • All services implement the same WSDL interface • Makes concrete assertion functions homogeneous • Facilitates compilation • Uniform input / output messages Common WSDL interface PIScoreClassifierSvc D = {(di, evidence(di))} {class(di)} {score(di)} PI_Top_k_svc (service registry) service class  Web service endpoint PIScoreClassifier  http://localhost/axis/services/PIScoreClassifierSvc

Execution model for Quality views Qurator quality framework Services registry Services implementation Binding  compilation  executable component • Sub-flow of an existing workflow • Query processing interceptor Abstract Quality view Host workflow: D  D’ QV compiler Host workflow Embedded quality workflow D D’ Quality view on D’

Example: original proteomics workflow Quality flow embedding point Taverna (*): workflow language and enactment engine for e-science applications (*) part of the myGrid project, University of Manchester - taverna.sourceforge.net

Example: embedded quality workflow

Interactive conditions / actions

Quality views for queries Actions: filtering, dump to DB / file

Qurator architecture

Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data • Qurator provides an environment for fast prototyping of quality hypotheses • Based on the notion of “evidence” supporting a quality hypothesis • With support for an incremental learning cycle • Quality views offer an abstract model for making data processing environments quality-aware • To be compiled into executable components and embedded • Qurator provides an invocation framework for Quality Views More info and papers: http://www.qurator.org Live demos (informal) available

Capturing User Perspective on Data Quality in E-Science