370 likes | 529 Views
Databases for Knowledge Discovery. Databases for Knowledge Discovery Jan H. van Bemmel Erasmus University Rotterdam. Databases for Knowledge Discovery. Natural sciences physics, chemistry, engineering models, experiments, theories ► ’hard’ data Humanities
E N D
Databases for Knowledge Discovery Databases for Knowledge Discovery Jan H. van Bemmel Erasmus University Rotterdam
Databases for Knowledge Discovery • Natural sciences • physics, chemistry, engineering • models, experiments, theories ► ’hard’ data • Humanities • arts, social sciences, economics • behavioural studies, text analysis ► ‘soft’ data • Biomedical and health sciences • biomedicine, health sciences • models, experiments, studies ► hard & soft data
Databases for Knowledge Discovery • Biomedicine & health sciences • Biomedical research related to the 'hard' scientific approach as in physics and engineering • Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations • Population-based research • data collected from populations of healthy and ill persons • This research can be subdivided into • retrospective research • prospective research
Databases for Knowledge Discovery • Biomedicine & health sciences • Biomedical research related to the 'hard' scientific approach as in physics and engineering • Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations • Population-based research • data collected from populations of healthy and ill persons • This research can be subdivided into • retrospective research • prospective research
Databases for Knowledge Discovery Research Database Regional Database Regional Database Regional Database Biomedicine and Health Sciences Basic Research experiments Clinical Research patients Health Research populations Discovery of new scientific knowledge from large databases of measurements, observations and interpretations
Databases for Knowledge Discovery • Biomedicine & health sciences • Until recently, basic research in biomedicine was done on organs and organisms. • Nowadays the fundamental challenges lay a magnitude lower: on the level of molecules and cells. • Research on organs and organisms is still of interest: breakthroughs from biomolecular research are to be translated to higher levels.
Databases for Knowledge Discovery • Biomedical research • Knowledge contained in multiple databases • of refereed articles and • databases on genes and proteins • MedLine: 11 million abstracts; 500,000/year • searching for articles in sphere of interest • how to find new knowledge? • how to cope with serendipity?
Databases for Knowledge Discovery • Biomedical research • Different methods to retrieve knowledge: • simple Boolean expressions • too specific: few references • too broad: avalanche of references • use of a more complex ‘fingerprint’ • combination of different databases • complex retrieval using ontology dbase for- ward in- verse
Databases for Knowledge Discovery • Biomedical research
Databases for Knowledge Discovery • Biomedical research
Databases for Knowledge Discovery Emails Word RFPs Jobs CVs, Skills average average Articles books organisation fingerprints content fingerprints people fingerprints • Biomedical research
Databases for Knowledge Discovery Find new associa- tions Matching methods Genetics Database Literature Database • Biomedical research Data mining A – B B – C A – C
Databases for Knowledge Discovery Find new associa- tions Matching methods Genetics Database Literature Database • Biomedical research Composition of a thesaurus from separate databases GDB: AAA; BBB LocusLink: AAA; CCC Hugo NC: AAA OMIM: BBB; CCC SwissProt: BBB concept: AAA synonyms: BBB; CCC
Databases for Knowledge Discovery • Biomedical research
Databases for Knowledge Discovery Ontology database Collexion • Biomedical research ACS construc- tor ACS viewer ACS model ACS: Associative Concept Space ACS valida- tion
Databases for Knowledge Discovery • Biomedicine & health sciences • Biomedical research related to the 'hard' scientific approach as in physics and engineering • Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations • Population-based research • data collected from populations of healthy and ill persons • This research can be subdivided into • retrospective research • prospective research
Databases for Knowledge Discovery UK 100 90 Growth of information systems in primary care 80 NL 70 60 50 Percentage of primary care practices 40 30 20 Computer- based patient records 10 Year 0 78 80 82 84 86 88 90 92 94 96 98 • Clinical research
Databases for Knowledge Discovery • Clinical research BloodLink The impact of guidelines-based decision support on lab test ordering in primary care.
Databases for Knowledge Discovery BloodLink Control Guideline- controlled clinical trialGroup Group No. of practices 21 23 No. of physicians 29 31 No. of patients 97,177 98,432 Sickfunds 52% 52% No. of order forms 12,786 12,700 • Clinical research
Databases for Knowledge Discovery Test BloodLink Guideline Difference BloodLink control ESR 5612 -29% 7932 Hemoglobin 6061 -17% 7332 WBC count 3719 -26% 5039 Hematocrite 3611 -25% 4830 Creatinine 3314 -34% 5024 Erytrocytes 3360 -28% 4690 MCV 3159 -32% 4642 Differentiatie 3060 -26% 4151 Cholesterol 3413 -1% 4354 TSH 3213 +9% 2954 Gamma-GT 2004 -42% 3466 Glucose in serum 2964 19% 2501 ALAT (SGPT) 1892 -34% 2850 Potassium 1096 -53% 2320 ASAT (SGOT) 959 -58% 2269 Glucose fasting 1286 -20% 1611 Triglycerides 1398 1% 1380 HDL cholesterol 1350 -2% 1382 Natrium 745 -30% 1070 Free T4 618 -47% 1163
Databases for Knowledge Discovery Test BloodLink Guideline Difference BloodLink control ESR 5612 -29% 7932 In case of thyroid disease, physicians were used to order the T4 test (free thyroxine); the protocol prescribed the TSH test instead (thyroid stimulating hormone) Hemoglobin 6061 -17% 7332 WBC count 3719 -26% 5039 Hematocrite 3611 -25% 4830 Creatinine 3314 -34% 5024 Erytrocytes 3360 -28% 4690 MCV 3159 -32% 4642 Differentiatie 3060 -26% 4151 Cholesterol 3413 -1% 4354 3213 +9% 2954 Free T4 TSH Gamma-GT 2004 -42% 3466 Glucose in serum 2964 19% 2501 ALAT (SGPT) 1892 -34% 2850 Potassium 1096 -53% 2320 ASAT (SGOT) 959 -58% 2269 Glucose fasting 1286 -20% 1611 Triglycerides 1398 1% 1380 HDL cholesterol 1350 -2% 1382 Natrium 745 -30% 1070 Free T4 618 -47% 1163
Databases for Knowledge Discovery Test BloodLink Guideline Difference BloodLink control Tests, such as SGOT (serum glu- tamic oxalacetic transaminase), Gamma GT and SGPT, had been ordered almost automatically; the protocols, however, did not support such tests. The same applies to K+. ESR 5612 -29% 7932 Hemoglobin 6061 -17% 7332 WBC count 3719 -26% 5039 Hematocrite 3611 -25% 4830 Creatinine 3314 -34% 5024 Erytrocytes 3360 -28% 4690 MCV 3159 -32% 4642 Differentiatie 3060 -26% 4151 Cholesterol 3413 -1% 4354 TSH 3213 +9% 2954 Gamma-GT 2004 -42% 3466 Gamma GT Glucose in serum 2964 19% 2501 ALAT (SGPT) 1892 -34% 2850 ALAT (SGPT) Potassium 1096 -53% 2320 ASAT (SGOT) 959 -58% 2269 ASAT (SGOT) Glucose fasting 1286 -20% 1611 Triglycerides 1398 1% 1380 HDL cholesterol 1350 -2% 1382 Natrium 745 -30% 1070 Free T4 618 -47% 1163
Databases for Knowledge Discovery BloodLink Control Guideline- controlled clinical trialGroup Group No. of practices 21 23 No. of GPs 29 31 No. of patients 97,177 98,432 Sickfunds 52% 52% No. of order forms 12,786 12,700 % of forms generated by BloodLink 89% 73% No. of requested tests 87,634 70,479 Average No. of tests per order1 6.9 5.5 1Student's t-test, N=44, p<0.001 • Clinical research
Databases for Knowledge Discovery • Clinical research Cardiology
Databases for Knowledge Discovery 100 90 80 70 sens (%) 60 50 40 30 20 10 0 100 90 80 70 60 50 40 30 20 10 0 spec (%) • Clinical research Critiquing system for hypertension # sens spec 1 0.94 0.36 2 0.86 0.70 3 0.72 0.82 4 0.65 0.75 5 0.73 0.69 6 0.70 0.78 7 0.88 0.52 8 0.74 0.77 CS 0.74 0.88
Databases for Knowledge Discovery Class N NL LVH RVH BVH AMI IMI MIX OTH VH+MI NL 382 0.9 0.4 0.0 1.4 1.6 0.0 0.1 95.5 LVH 183 19.0 0.5 0.0 4.3 6.9 0.2 0.0 69.0 RVH 55 40.6 6.7 2.7 1.2 2.1 0.0 0.9 45.8 BVH 53 22.0 54.7 14.5 5.3 1.9 0.0 0.0 1.6 AMI 170 14.3 2.6 0.6 0.0 1.8 0.7 0.0 80.0 IMI 273 19.8 2.6 0.2 0.0 0.7 0.1 0.0 76.7 MIX 73 2.5 4.1 1.6 0.0 51.6 37.4 0.0 2.7 VH+MI 31 22.6 0.0 0.0 0.0 0.0 0.0 0.0 16.1 61.3 • Clinical research Reference
Databases for Knowledge Discovery • Assessment of different interpretation programs 90 85 80 % agreement with referees 75 70 cardiologists systems 65 60 60 65 70 75 80 85 90 % agreement with clinical data • Clinical research Computer- assisted ECG inter- pretation
Databases for Knowledge Discovery • Biomedicine & health sciences • Biomedical research related to the 'hard' scientific approach as in physics and engineering • Clinical research using rather 'hard' data, and sometimes ‘soft’ subjective observations • Population-based research • data collected from populations of healthy and ill persons • This research can be subdivided into • retrospective research • prospective research
Databases for Knowledge Discovery 100 90 UK 80 Growth of information systems in primary care Central Database 70 NL 60 Percentage of primary care practices 50 40 CPR 30 20 Computer- based patient records 10 CPR Year 0 96 98 78 80 82 84 86 88 90 92 94 CPR CPR health care practices • Population-based research: • retrospective • Post-marketing surveillance of drugs • Combinations of drugs: interactions • Longitudinal databases of about 500,000 patients • Patient privacy and data security
Databases for Knowledge Discovery Research database research data research data research data research data • Population-based research: • retrospective population- based research
Databases for Knowledge Discovery Research database research data research data research data research data recessive • Population-based research: • retrospective population- based research Pedigree tree • coupling of clinical data to genealogical database • municipal records of > 20,000 individuals • each disorder could be coupled to common ancestor: genes involved in diabetes, Alzheimer’s disease, etc.
Databases for Knowledge Discovery Research database research data research data research data research data • Population-based research: • prospective Rotterdam Study population- based research
Databases for Knowledge Discovery Research database research data research data research data research data • Population-based research: • prospective Rotterdam Study population- based research • Prospective longitudinal database • 10,000 persions > 55 years of age • relationships between risks and diseases • cardiovascular and vessel-wall diseases, glaucoma neurologic diseases (Alzheimer), osteoporosis
Databases for Knowledge Discovery Generation R Research database research data research data research data research data • Population-based research: • prospective population- based research
Databases for Knowledge Discovery Generation R Research database research data research data research data research data • Population-based research: • prospective population- based research • Prospective longitudinal database • 10,000 children from pregnancy onwards • relations risks and genetics/environmental data • perinatal circumstances, diseases at young age cultural backgrounds, impact of education, etc.
Databases for Knowledge Discovery • A formal ('forward‘ ) method in analysing large research databases may hamper the flexible attitude of a researcher, not knowing in advance what he may expect (serendipity). • ‘Hard’ and‘soft’ examples from biomedicine and the health sciences show that computers can be very helpful in finding new and unforeseen (‘inverse’ ) associations between the data stored in research databases. • Well-documented databases are an enormous treasure for the advancement of scientific research.