140 likes | 225 Views
Creation and Maintenance of GeneKeyDB. Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker. The Problem. There exists thousands of biomedical data sources. In 2006, there were ~557 relevant public resources in molecular biology. This is growing rapidly.
E N D
Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker
The Problem • There exists thousands of biomedical data sources. • In 2006, there were ~557 relevant public resources in molecular biology. • This is growing rapidly. • 203 sources in 1999 • 226 sources in 2000 • 277 sources in 2001.
The Problem • Traditional database approaches are too structured. • Scientific objects change identification over time. • Gene names change over time. • The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols. • SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-like 1.
The Solution • GeneKeyDB • A gene-centered relational database developed to enhance data mining in biological data sets. • GeneKeyDB relies primarily on existing database identifiers derived from community databases (NCBI, GO, Ensembl, et al.) as well as the known relationships among those identifiers. • Version 1 is already out! • http://www.biomedcentral.com/1471-2105/6/72
Weaknesses of Version 1 • Can no longer be updated • Complex queries must be made to the database in order to obtain desired information
Complex Queries SELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organism FROM ll_xp_cdd, ll_np_cdd, ll_locus WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score AND ll_id IN (SELECT ll_id FROM ll_refseq_xm WHERE ll_refseq_xm_id IN (SELECT ll_refseq_xm_id FROM ll_xp_cdd, ll_np_cdd WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score)) AND ll_id IN (SELECT ll_id FROM ll_refseq_nm WHERE ll_refseq_nm_id IN (SELECT ll_refseq_nm_id FROM ll_xp_cdd, ll_np_cdd WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));
Current Research • Creation of APIs to validate data in the database and to enable querying to become much easier for the user. • One-step updating of the database and the information it contains.
API Alternative // fxn(search_params, desired_info), returns ll_id curated.cdd(score[ ],null) curated_score[ ] score[ ] locus_id1[ ] gaa.cdd((name[ ],score[ ]), score[ ]) gaa_name[ ] name[ ] gaa_score[ ] score[ ] locus_id2[ ] curated.cdd(name[ ],score[ ]) curated_name[ ] name[ ] locus_id[ ] intersect(locus_id1[ ],locus_id2[ ]) locus(organism[ ], locus_id[ ]) print(gaa_name[ ], curated_name[ ], organism[ ])
External Implementations • Some databases have APIs as well. • Ensembl • APIs are done in Perl. • APIs for GeneKeyDB will be done in Java. • More structured language. • Easier to read.
The Future of GeneKeyDB • GeneKeyDB will join even more external and widely used databases together. • Code for updating GeneKeyDB will tie into database information that will change in expected ways. • Lowers the required number of code rewrites. • GeneKeyDB will be dynamically updated.
The Future of GeneKeyDB • APIs made that will be written in Perl. • Perl is used often, almost exclusively, by biologists. • Can have Perl APIs tie into Java APIs, rather than creating all new ones.
Comments? Questions? • http://genereg.ornl.gov/gkdb/