1 / 26

Quality views: capturing and exploiting the user perspective on information quality

This research paper discusses the importance of information quality in e-science and presents a conceptual model and architectural framework for capturing and utilizing user preferences on data quality. The goal is to support users in understanding and evaluating information quality in specific data domains. The paper also highlights the challenges faced by scientists in assessing the quality of public data and proposes a solution to make quality criteria explicit.

voegele
Download Presentation

Quality views: capturing and exploiting the user perspective on information quality

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science, University of Manchester Alun Preece, Binling Jin Department of Computing Science, University of Aberdeen Quality views: capturing and exploiting the user perspective on information quality www.qurator.org Describing the Quality of Curated e-Science Information Resources

  2. Outline • Information and information quality (IQ) in e-science • Quality views: a quality lens on data • Semantic model for IQ • Architectural framework for quality views • State of the project and current research

  3. Information and quality in e-science Lab experiment E-science experiment In silico experiments (eg Workflow-based) Public BioDBs Public BioDBs Can I trust this data? What evidence do I have that it is suitable for my experiment? In silico experiments (eg Workflow-based) • Scientists are increasingly required to place more of their data in the public domain • Scientists use other scientists' experimental results as part of their own work • Variations in the quality of the data being shared • Scientists have no control over the quality of public data • Lack of awareness on quality: difficult to measure and assess • No standards!

  4. Qualitative proteomics: identification of proteins in a cell sample A concrete scenario Wet lab Step n Step 1 Information service (“Dry lab”) Hit list: {ID, score, p-value,…} Candidate Data for matching (peptides peak lists) Match algorithm Reference DBs - MSDB - NCBI - SwissProt/Uniprot False negatives: incompleteness of reference DBs, pessimistic matching False positives: optimistic matching

  5. The complete in silico workflow 1: identify proteins; 2: analyze their functions What is the quality of this processor’s output? Is the processor introducing noise in the flow? GO = Gene Ontology Reference controlled vocabulary for describing protein function (and more) How can a user rapidly test this and other hypotheses on quality?

  6. The users’ perception of quality “One size fits-all” approach to quality does not work • Scientists tend to apply personal acceptability criteria to data • Driven mostly by prior personal and peers’ experience • Based on the expected use of the data • What levels of false positives / negatives are acceptable? Scientists often have only a blurry notion of their quality requirements for the data It is difficult for users to implement quality criteria and test them on the data

  7. Quality views: making quality explicit Our goals: • To support groups of users within a (scientific) community in understanding information quality on specific data domains • To foster reuse of quality definitions within the community Approach: • Provide a conceptual model and architectural framework to capture user preferences on data quality • Let users populate the framework with custom definitions for indicators and personal decision criteria • The framework allows uses to rapidly test quality preferences and observe their effect on the data • Semi-automated integration in the data processing environment Quality views: A specification of quality preferences and how they apply to the data

  8. Basic elements of information quality • 1 - Quality dimensions: • A basic set of generic definitions for well-known non-functional properties of the data • Ex. Accuracy: describes “how close the observed value is to the actual value” • 2- Quality evidence: • Any measurable quantities that can be used to express formal quality criteria • Evidence is not by itself a measure of quality • Ex. “Hit ratio in protein identification” 3- Quality assertions: Decision procedures for data acceptability, based on available evidence

  9. The nature of quality evidence Direct evidence: indicators that represent some quality property • Algorithms may exist to determine the biological plausibility of an experiment’s outcome • may be costly, not always available, and possibly inconclusive Indirect evidence: inexpensive indicators that correlate with other more expensive indicators • Eg some function of “hit ratio” and “sequence coverage” • Need experimental evidence of the correlation Goals: design suitable functions to collect / compute evidence associate evidence to data (data quality annotation)

  10. Generic (e-science) evidence • recency: how recently the experiment was performed, or its results published • Evidence: submission, publication dates • submitter reputation: is the lab well-known for its accuracy in carrying out this type of experiments • Metric: lab ranking (subjective) • publications prestige: are the experiment results presented in high-profile journal publications • Metric: Impact Factor and more (official) Collecting data provenance is the key to providing most of these types of evidence

  11. Semantic model for Information Quality The key IQ concepts are captured using an ontology: • Provides shareable, formal definitions for • QualityProperties (“dimensions”) • Quality Evidence • Quality Assertions • DataAnalysisTools: Describe how indicators are computed • The ontology is implemented in OWL DL • Expressive operators for defining concepts and their relationships • Support for subsumption reasoning

  12. Top-level taxonomy of quality dimensions Domain-specific User-oriented Concrete qualities Generic dimensions Wang and Strong, Beyond Accuracy: What Data Quality Means to Data Consumers, Journal of Management Information Systems, 1996

  13. Main taxonomies and properties Class restriction: MassCoverage   is-evidence-for . ImprintHitEntry Class restriction: PIScoreClassifier   assertion-based-on-evidence . Mass PIScoreClassifier   assertion-based-on-evidence . Coverage assertion-based-on-evidence: QualityAssertion  QualityEvidence is-evidence-for: QualityEvidence  DataEntity

  14. Associating evidence to data • Annotation functions compute quality evidence values for datasets and associate them to the data • Defined in the DataAnalysisTool taxonomy as part of the ontology

  15. Quality assertions Defined as ranking or classification functions f(D,I): Input: • dataset D • vector I = [I1,I2,…In] of indicator values Possible outputs: • A classification {(d,ci)} for each d  D • A ranking {(d,ri)} for each d  D The classification scheme C = {c1,..ck} and the ranking interval [r,R] are themselves defined in the ontology Assertions formalize the user’s bias on evidence as computable decision models on that evidence Example: PIScoreClassifier partitions the input dataset into three classes {low, avg, high} based on a function of [HitScore, MassCoverage]

  16. Quality views in practice Quality views are declarative specifications for: • desired data classification models and evidence • I = [I1,I2,…In] • classi(d), ranki(d) for all d  D • condition-action pairs, eg: • If <condition on class(d), rank(d), I> then <action> • Where <action> depends on the data processing environment • Filter out d • Highlight d in a viewer • Send d to a designated process or repository • … • Quality views are based on a small set of formal operators • They are expressed using an XML syntax

  17. Execution model for Quality views QV compiler Qurator quality framework - Quality assertion services Embedded Executable QV Dataset D D’ Quality view on D’ • QVs can be embedded within specific data management host environments for runtime execution • For static data: a query processor • For dynamic data: a workflow engine Declarative (XML) QV Host environment

  18. User model Implementing rapid testing of quality hypotheses: Compose quality view (XML) IQ ontology Compile and deploy bindings Quality assertion services Execute on test data Assess View results Re-deploy (Update assertion models)

  19. The Qurator quality framework

  20. Compiled quality workflow

  21. Embedded quality workflow

  22. Example effect of QV: noise reduction

  23. Summary • A conceptual model and architecture for capturing the user’s perception on information quality • Formal, semantic model makes concepts • Shareable • Reusable • Machine-processable • Quality views are user-defined and compiled to data processing environments (possibly multiple) • The Qurator framework supports a runtime model for QVs • Current work: • Formal semantics for QVs • Exploiting semantic technology to support the QV specification task • Addressing more real use cases Main paradigm: let scientists experiment with quality concepts in an easy and intuitive way by observing the effect of their personal bias

More Related