1 / 18

Quality views: capturing and exploiting the user perspective on data quality

Quality views: capturing and exploiting the user perspective on data quality. Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK. http://www.qurator.org.

Download Presentation

Quality views: capturing and exploiting the user perspective on data quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University of Manchester, UK Alun Preece, Binling Jin Department of Computing Science University of Aberdeen, UK http://www.qurator.org

  2. Integration of public data (in biology) UniProt GenBank EnsEMBL Entrez dbSNP • Large volumes of data in many public repositories • Increasingly creative uses for this data • Their quality is largely unknown

  3. Quality of e-science data Defining quality can be challenging: • In-silico experiments express cutting-edge research • Experimental data liable to change rapidly • Definitions of quality are themselves experimental • Scientists’ quality requirements often just a hunch • Quality tests missing or based on experimental heuristics • Often implicit and embedded in the experiment  not reusable A data consumer’s view on quality: Criteria for data acceptability within a specific data processing context

  4. Example: protein identification Remove likely false positives  Improve prediction accuracy Quality filtering Goal: to explicitly define and automatically add the additional filtering step in a principled way “Wet lab” experiment Support evidence: provenance metadata Data output Reference databases Protein identification algorithm Protein Hitlist Protein function prediction

  5. Our goals Offer e-scientists a principled way to: • Discover quality definitions for specific data domains • Make them explicit using a formal model • Implement them in their data processing environment • Test them on their data … in an incremental refinement cycle Benefits: • Automated processing • Reusability • “plug-in” quality components

  6. Approach • Qurator • architectural framework: • runtime environment • data-specific quality services Research hypothesis: adding quality to data can be made cost-effective • By separating out generic quality processing from domain-specific definitions Define abstract quality views on the data Map quality view to an executable process Execute quality views

  7. Abstract quality view model Quality Metadata Evidence Assertions Classification1 Class space 1 e3 C11 C12 … Classification2 e2 Class space 2 … Coverage C21 C22 e1 PeptidesCount Conditions: regions specification Data annotation         Actions on regions  Data    

  8. Semantic model for quality concepts Evidence Meta-data model (RDF) Quality “upper ontology” (OWL) Quality evidence types Evidence annotations are class instances

  9. Quality hypotheses discovery and testing Targeted Compilation Compilation Compilation • Multiple target environments: • Workflow • query processor Deployment Deployment Execution on test data Deployment Quality-enhanced User environment Quality-enhanced User environment Quality-enhanced User environment Target-specific Quality component Target-specific Quality component Performance assessment Target-specific Quality component abstract quality view

  10. Generic quality process pattern Collect evidence <variables <var variableName="Coverage“ evidence="q:Coverage"/> <var variableName="PeptidesCount“ evidence="q:PeptidesCount"/> </variables> - Fetch persistent annotations - Compute on-the-fly annotations Persistent evidence <QualityAssertion serviceName="PIScoreClassifier" serviceType="q:PIScoreClassifier" tagSemType="q:PIScoreClassification" tagName="ScoreClass" Compute assertions Classifier Classifier Classifier <action> <filter> <condition> ScoreClass in {``q:high'', ``q:mid''} and Coverage > 12 </condition> </filter> </action> Evaluate conditions Execute actions

  11. Bindings: assertion  service • All services implement the same WSDL interface • Makes concrete assertion functions homogeneous • Facilitates compilation • Uniform input / output messages Common WSDL interface PIScoreClassifierSvc D = {(di, evidence(di))} {class(di)} {score(di)} PI_Top_k_svc (service registry) service class  Web service endpoint PIScoreClassifier  http://localhost/axis/services/PIScoreClassifierSvc

  12. Execution model for Quality views Qurator quality framework Services registry Services implementation Binding  compilation  executable component • Sub-flow of an existing workflow • Query processing interceptor Abstract Quality view Host workflow: D  D’ QV compiler Host workflow Embedded quality workflow D D’ Quality view on D’

  13. Example: original proteomics workflow Quality flow embedding point Taverna (*): workflow language and enactment engine for e-science applications (*) part of the myGrid project, University of Manchester - taverna.sourceforge.net

  14. Example: embedded quality workflow

  15. Interactive conditions / actions

  16. Quality views for queries Actions: filtering, dump to DB / file

  17. Qurator architecture

  18. Summary For complex data types, often no single “correct” and agreed-upon definition of quality of data • Qurator provides an environment for fast prototyping of quality hypotheses • Based on the notion of “evidence” supporting a quality hypothesis • With support for an incremental learning cycle • Quality views offer an abstract model for making data processing environments quality-aware • To be compiled into executable components and embedded • Qurator provides an invocation framework for Quality Views More info and papers: http://www.qurator.org Live demos (informal) available

More Related