340 likes | 436 Views
Evidence-Based Information Retrieval in Bioinformatics. Timothy B. Patrick, PhD Healthcare Administration and Informatics, University of Wisconsin-Milwaukee. Goal of the Project.
E N D
Evidence-Based Information Retrieval in Bioinformatics Timothy B. Patrick, PhD Healthcare Administration and Informatics, University of Wisconsin-Milwaukee
Goal of the Project • The overall, long term goal of this research project is to contribute to evidence-based information retrieval in post-genomic medicine • proof of the effectiveness of the way particular information resources are used and combined in order to retrieve that information
Aims • Specific Aim 1: Determine existing pitfalls in accessing literature on gene function • Specific Aim 2: Based on user warrant, determine the current state of evidence-based functional genomic retrieval • Specific Aim 3: Based on literary warrant, determine the current state of evidence-based functional genomic retrieval
“Determine existing pitfalls in accessing literature on gene function” • That is the topic of my talk later today. • “Asymmetries in Retrieval of Gene Function Information”
The Study • Investigated an example of different paths to the literature that might look to a user to be equivalent but which are not equivalent due to various features of the resources involved. • Knowledge that they are not equivalent requires knowledge of metadata about the resources.
Three Paths Affymetrix Affymetrix Affymetrix Genbank Accession number Genbank Accession number Genbank Accession number Nucleotide Gene Pubmed links Pubmed links Pubmed Pubmed Pubmed Pubmed ID Pubmed ID Pubmed ID
http://www.affymetrix.com/corporate/media/genechip_essentials/gene_expression/Features_and_probes.affxhttp://www.affymetrix.com/corporate/media/genechip_essentials/gene_expression/Features_and_probes.affx
Three Paths Affymetrix Affymetrix Affymetrix Genbank Accession number Genbank Accession number Genbank Accession number Nucleotide Gene Pubmed links Pubmed links Pubmed Pubmed Pubmed Pubmed ID Pubmed ID Pubmed ID
Methods • We first collected representative DNA Accession numbers associated with genes expressed in a microarray experiment designed to identify changes in gene expression associated with skeletal muscle recovery from immobilization-induced sarcopenia.
Methods • Next, we retrieved the Unique Identifiers (UI’s) of Entrez Pubmed citations that were associated with the Accession numbers by each of the three Entrez resources. • Directly in the case of Entrez Pubmed • Indirectly, via Pubmed links in the case of Entrez Nucleotide and Entrez Gene • Next, we compared the number of Pubmed ID's retrieved by the three resources for each of the Accession numbers.
Summary of Pubmed ID’s by Accession Number Pubmed Nucleotide Gene
Methods • Compared number of Pubmed ID’s produced for each Accession number by each path. • Applied non-parametric test: Kendall’s W • Pubmed versus Nucleotide versus Gene • p < .05
Affymetrix Affymetrix Affymetrix Genbank Accession number Genbank Accession number Genbank Accession number Nucleotide Gene Pubmed links Pubmed links Pubmed Pubmed Pubmed Pubmed ID Pubmed ID Pubmed ID The Three Paths Are Not Equivalent ≠ ≠
The SI field identifies secondary source databanks and accession numbers of outside resources discussed in MEDLINE articles. The field is composed of the source followed by a slash followed by an accession number and can be searched with one or both components, e.g., genbank [si], AF001892 [si], genbank/AF001892 [si]. The SI field and the Entrez sequence database links are not linked. The PubMed links to these databases are created from the reference field of the GenBank or GenPept flat file. These references include citations that discuss the specific sequence presented in these flat files. http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=helppubmed.box.pubmedhelp.Box_1_Search_Field_D#pubmedhelp.Secondary_Source_ID_
“Based on user warrant, determine the current state of evidence-based functional genomic retrieval” • Interviews with biologists who use microarrays to study gene expression levels • Questions concern what methods for IR are used, why they consider the methods effective, what are criteria of success and failure, and how they see the role of biomedical librarians in the process
Interviews in Progress • Five interviews currently scheduled at the University of Missouri-Columbia • Interviews being scheduled at University of Wisconsin-Milwaukee • In March we interviewed two subjects at NIG in Japan
“Based on literary warrant, determine the current state of evidence-based functional genomic retrieval” • We wanted to investigate how and to what extent biological science researchers reported their information retrieval methods, including details of why they used the methods they did.
Methods • We searched OVID Medline on October 1, 2004 for the period 1966 to September Week 4 2004 with the query “Oligonucleotide Array Sequence Analysis/”, producing 10746 results. • We then limited the results to English (10374), excluded “review articles” (9049), and limited to the years 2003 – 2004 (4798). We next ranked journals in the results by number of articles, and selected a population of all of the articles from the 13 top journals (n=1373). We randomly sampled 150 articles from that population.
Methods • If the authors of the paper did report gene function, we wanted to know which information sources and retrieval methods they used, as well as the reasons they had for using them. • Functional Attribution Reported • Sources of Information Reported • Retrieval Strategy Reported • Grounds for Choice of Sources Reported • Grounds for Retrieval Strategy Reported
Methods • How were details of sources and retrieval methods reported? • Methods or Procedures • Results • Discussion
Results • Typical evidence for attribution of gene function consists of literature citations. • When a literature search (e.g. Pubmed search), or a search of other knowledge sources (e.g. NCBI databases), is cited as the source of evidence to support attribution of function, rarely are details of the search reported. • Reasons for using sources and retrieval methods not reported.
Results • When information retrieval methods are described in the paper, they are typically mentioned only in the “Results” or “Discussion” sections of the paper, and not in the “Methods” section. • Wet bench methods are reported in more detail than dry bench methods.
Implications for Information Practice • There is a need to embrace a workflow concept • There is a need to develop standards for documentation in e-science • There is a need to use multidisciplinary teams to develop workflows
“There is a need to embrace a workflow concept” • Call a scenario of the use of a combination of multiple information resources databases and analysis tools a workflow • Workflows are increasingly important for information retrieval and processing in the Life Sciences
Computer based Information retrieval and processing Traditional Science “There is a need to develop standards for documentation in e-science” The Digitization of Science or E-science
Life Science Information Retrieval and Processing Workflows
Life Science Information Retrieval and Processing Workflows documentation
Life Science Information Retrieval and Processing Workflows documentation technology to facilitate documentation
Life Science Information Retrieval and Processing Workflows documentation editorial policy drivers technology to facilitate documentation
KNOWLEDGE-ENABLED WORKFLOWS METADATA TOOLS INFORMATION ITEMS “There is a need to use multidisciplinary teams to develop workflows”
KNOWLEDGE-ENABLED WORKFLOWS METADATA TOOLS INFORMATION ITEMS domain expert (scientist)
KNOWLEDGE-ENABLED WORKFLOWS METADATA TOOLS INFORMATION ITEMS domain metadata expert (information specialist) domain expert (scientist)
KNOWLEDGE-ENABLED WORKFLOWS METADATA domain metadata expert (information specialist) TOOLS domain expert (scientist) INFORMATION ITEMS