360 likes | 514 Views
Data quality issues in population-based studies; Impacts on harmonisation. 09/04/2006 Data Quality/Usability Issues in e-Health: Practice, Process and Policy Isabel Fortier, Ph.D. P3G Observatory Génome Québec. Data quality. Numbers are always valid without context. Objectives.
E N D
Data quality issues in population-based studies; Impacts on harmonisation 09/04/2006 Data Quality/Usability Issues in e-Health: Practice, Process and Policy Isabel Fortier, Ph.D. P3G Observatory Génome Québec
Data quality Numbers are always valid without context
Objectives • Give a brief overview of the factors or contexts influencing data quality in studies as well as in aggregated analysis • Discuss their potential impacts on results • Initiate discussion on potential ways to increase data quality Principal focus • Selection and follow-up of participants • Information collection and treatment • Data management
Quality of the sample • Potential bias if the disease or the “exposure” status influence the selection of the subjects or their participation to the study • The study sample is “never” totally representative of the reference population
Example Participation in a study can allow access to a better clinical follow-up or to a comprehensive evaluation of health not easily accessible through the medical system in place. Impact: potential higher participation of subjects with health problems or at risk to develop some. Can lead to an over representation of unhealthy persons in the database. According to the context, selection bias can increase or decrease the strength of the association
Possible ways to limit the impact • Develop good selection and follow-up designs Taking into account scientific validity but also feasibility (costs, infrastructure, organisation of medical care in the country, etc.) • Identify and keep in mind the limits of the study • If possible, compare participants and non participants characteristics to identify discrepancies • Limit results interpretation to the adequate reference population
Aggregation: Selection criteria frequently used in population-based biobanks • Age group (~100%) • Country of residence (~100%) • Residence in specific communities/geographic areas (> 75%) • Sex (excluding pregnancy) (~10%) • Pregnancy (~ 5-10%) • Employment status (~ 5-10%) • etc. Extracted from P3G Observatory
Aggregation a simple problem? Age groups and sampling design; 8 population-based studies * Different selection bias among studies Extracted from P3G Observatory
Quality of the information collected It is RARELY possible to perfectly classify participants : Health conditions, Exposures, Socioeconomic status, Genetic characteristics, etc.
Participants classification J. Lacroix,MSO6027
Differential and non-differential information bias Differential bias Variation of classification error between groups under study Impact: Increase OR decrease association strength Non-differential bias Similar classification error between groups under study Impact: Decrease association strength
Information bias: Definition of health status • Heterogeneity in the expression of the health problem under study • Increasing focus on complex diseases often without efficient tools to define expression (subtypes, co-morbidity, etc.) • Evaluators • Potential subjectivity of the interviewers, study staff, etc. • Participants • Subjectivity in the responses obtained from participants • Diagnostic tools and information sources • Validity (ability to distinguish between who has a condition and who does not)
Information sources: 10 population-based studies Extracted from P3G Observatory
Governmental databases: An example • Medical services (administrative DB):for 60 % of the consultations: • Date, professional class and specialty, diagnostics classification (principal reason of the consultation-for payment), etc. • Hospitalizations (MED-ECHO): for all hospitalizations: • Date of admission, type of admission, hospitalization duration, principal diagnosis (CIM-9/10), secondary diagnoses, treatments, etc. • Pharmaceutical services: for 50 % of the prescriptions: • Date, medication class, code, form, dosage and quantity, duration of treatment, physician specialty. • Deaths: • Date of death, location, cause of death, etc.
Potential definitions of asthma cases • Asthma as principal reason of consultation (RAMQ) • At least once or several times? • Information on consultation available only for 60% of the population • Influence of age distribution on follow-up (Cartagène: 24-75 years) • Only limited information on secondary diagnostics • Medication for asthma prescribed • Medications not all specific to asthma • Medications prescribed but not necessarily taken • Information available for only 50% of the population • Asthma as cause of death • Asthma will rarely be coded as the cause of death • Hospitalization for asthma (MED-ECHO) • Or merge information between databases to define cases How to compare those information between studies? What is the validity of these potential outcomes (sensitivity/specificity)?
Potential impact on results “Gold standard” OR = 2 Study measures Non differential bias Sensitivity (E = Non E) = 0.8 Specificity (E = Non E) = 0.9 OR = 1.6
Aggregation of information from questionnaires: An example Targeted outcome: Cancers • Ever had cancer • Type of cancer • Onset of symptoms or diagnostic date
Ever had Cancer Study 1 Have you ever had cancer? Yes, No, I don't know Study 2 Have you ever been told by a doctor or other health professional that you had cancer or a malignancy of any kind? Yes, No, Refused, Don't Know Study 3 Has a physician ever told you that you had any of the following cancers? List of cancer Extracted from P3G Observatory
Type of cancer Study 1 What kind of cancer?____________________________ Study 2 Has a physician ever told you that you had any of the following cancers? Prostate cancer, Lung or bronchial cancer,Colon or rectal cancer, Bladder cancer, Lymphoma,Other cancer (define) Study 3 What kind of cancer was it? Extracted from P3G Observatory
Onset of symptoms or diagnostic date Study 1 In which year was this ascertained? Year |_|_|_|_| or age at that time |_|_| Study 2 How old were you when the cancer was first diagnosed? |___|___|___| age in years, refused, don't know Study 3 Prostate cancer O Never O Before October 2001 O Oct. 2001 - July 2003 O After July 2003 Extracted from P3G Observatory
Information bias:Risk factors definition • Tools and procedures used for estimation of genetic and environmental factors (questionnaires, laboratory standard operation procedures, technology used, etc) • Validity… • Evaluators/technicians • Potential subjectivity of interviewers, bias introduced by study staff, etc. • Participants (questionnaires) • Subjectivity in the responses obtained from participants • Definition of the environmental exposure in time • Important variations of the exposure level through follow-up
Aggregation: An exampleMarital Status • Please indicate your current marital status by ticking the appropriate box. SINGLE, MARRIED OR LIVING AS MARRIED, WIDOWED, SEPARATED, DIVORCED • Are you now married, widowed, divorced, separated, never married or living with a partner? MARRIED, WIDOWED, DIVORCED, SEPARATED, NEVER MARRIED, LIVING WITH PARTNER, REFUSED, DON'T KNOW • What is your marital status? MARRIED, SINGLE, DIVORCED, WIDOWED Extracted from P3G Observatory
Data management: A complex task Biobank data • Collected and produced in different centers • Recruitment centers or clinics, genotyping laboratories, etc. • Heterogeneous • Biochemical and physiological measures, genealogies, genotypes, etc. • Various formats • Databases, papers, electronic, XML, etc. • Various codification rules between centers • Extensive • Important number of participants • Longitudinal data • New high-throughput genotyping technologies • Confidential information
Data Management: Impacts on quality • Data entry • Carried out by staff with various backgrounds, not necessarily aware of the consequences of inexact or incomplete data on the overall quality of the study • Samples and questionnaires identification and manipulation (DNA extraction, biochemical analysis, etc) • Important potential for errors in the processes, numerous manipulations • Keys generation and management (identification codes) • A complex and crucial procedure to protect identity and avoid errors in the correspondence between identification numbers and individuals
Impacts on quality… • Size of databases • Millions of genotypes correlated with millions of epidemiological outcomes • Increasing complexity of data transfer, storage, query and analysis • Validation • Essential to insure continuous quality controls, including cross-validations, statistical validations, etc.
Conclusion Realise a population-based study: A complex task, generally achieved with limited resources To obtain sufficient statistical power for investigation of the impact of genes and environment on complex disease, aggregation of data between studies will often be needed.
How to facilitate aggregation of information? • Facilitate exchange of expertise and merging of efforts (networks of collaboration) • Allow easy access to relevant protocols, procedures, questionnaires, etc. • If relevant, facilitate development of common standard operation procedures or tools • Work prospectively, for aggregation of future studies Different initiatives…
P3G Observatory A tool among others • TOOLS PERTAINING TO STUDIES • Descriptive information on worldwide population-based biobanks and tools for comparison and harmonisation between specific biobanks. • Description: • For targeted studies, description of methods, data, ethics and governance rules, operation procedures, etc.. • Comparison: • Tools for comparison, between targeted studies, of the information collected or produced and of procedures used. • Harmonisation: • Specific models and procedures for harmonisation of the information collected or produced between subgroups of studies. • GENERAL TOOLS • Repository of reference tools and documentation to support population-based biobanks activities. For example: • Repository of standard operation procedures or good practices guides. • Methodological documentation in epidemiology or genomics. • Reference tools in statistics • Information technology applications (OpenBiobank) • Repository of useful websites • Etc. DESIGN OF STUDIES ETHICS AND GOVERNANCE INFORMATION COLLECTION/ TREATMENT INFORMATION TECHNOLOGY DATA ANALYSIS
OpenBiobank Application Suite 3 Samples Management Application 1 Participants Recruitment and Evaluation Management Application 6 Integration Application Projects & Contacts Management Data Staging & Validation Identification protection / Anonymization DATA WAREHOUSE Data Querying and Reporting 2 Epidemiological Data Application 4 Biochemical Analysis Application 5 Genotyping Application Data from different sources: Governmental DB, Project Results, etc. 7 Genetic Statistical Package