Enhancing Access to Research Data: T2D Project Overview

T2D + data identification, curation& duration Maxine Tedesco ACCOLEDS: December 2-4, 2009

Table to Data (T2D) Project Approved March/08 at the COPPUL director’s meeting as a collaborative project seeking to implement a system of linking articles & data in open access journals published at COPPUL institutions.

T2D activities to date • May/08: Brainstorming at IASSIST conference • July/08: Drupal Wiki established & “Outline of Activities” disseminated to project members • Fall/08: Maxine undertook a Literature Search (building on work done by Jim Jacobs, Feb/08) • December/08: Maxine reported at ACCOLEDS and renewed effort to involve project members • Spring/09: Maxine investigated related project topics in connection with Study Leave research Additionally, Chuck liaised/advocated for the project throughout the timeline & consultation with OA publishers was undertaken by some project members.

T2D project stages • Investigating • Literature Searches re: background, tools, etc. • Recruiting • Open access publishers amenable to a pilot project • Researchers willing to deposit data • Marking • Develop a set of descriptive tags for table content • Identify which parts of a data file “should” be linked and/or archived • Tooling (i.e., tools for markup, searching & display) • Evaluating/Reporting (i.e., HOW the project results contribute to research, teaching & learning)

So … What Is In It For Us? This seemed like a reasonable question to investigate further in the research in terms of “background information”.

Taking into account researchers’ disciplinary differences, tables/figures are increasingly: • used as a more effective summary of the article’s content than subject headings or other descriptors • used as a quick means of identifying types of data, methodologies &/or results • used to assess article relevance before reading the entire article • less effective if completely extracted from the surrounding explanatory text and/or complementary tables/figures

DISAGGREGATION Disaggregation of article components such as tables/figures facilitates searching at a greater level of granularity in order to: • Improve search precision (# of relevant items) & recall (# of tables/figures not otherwise retrieved in a traditional search) • Facilitate the REAGGREGATIONof a journal article’s components into new forms/formats

REAGGREGATION? Researchers wish to easily incorporate tabular information: • into new documents (to support original research) • into multimedia documents (to support presentations - classrooms or conferences) • into other contexts (utilize data in pre-existing tables rather than generate new time-consuming and/or expensive datasets) • into a comparison of similar information (to check one’s own work against other work)

So … What Can Make It Easier To Retrieve Relevant Tables/Figures? The research was decidedly sparse in this area or not quite as “on-topic” as one would have hoped.

Overview of Literature Review The research mostly dealt with such topics as: • Making T&F (tables/figures) more accessible to the visually impaired. • Improved graphical presentation of T&F. • Poor quality of T&F replication in electronic versions of documents. • Improved dissemination of statistical information. • Full-text does not necessarily mean the inclusion of T&F.

Format-Specific Databases • TableBase (Gage; 1997+) • table title, table text, and descriptor fields are searchable • text that accompanies the table is not searchable or retrievable from the product • tables are directly downloadable to Excel • Statistical Universe (Lexis-Nexis PowerTables; 2000+) • users search by “criteria” • links to full-text documents in the CIS/LEXIS-NEXIS digital archive & on WWW sites • download a PDF file or an Excel spreadsheet

SEARCH RESULTS from TableBase

TYPICAL RECORD in TableBase

Databases with “Deep Indexing” features • Illustrata (ProQuest/CSA; 2006+) • assigns 7-8 index terms per image (these are searchable but not the table text itself) • thumbnail images for quick preview • links to full-text and other components within the product • Selected ProQuest Databases (Oct. 1, 2009+) • deep indexing of images added along with traditional abstracting & indexing of text (at no additional cost)

Illustrata Results Page

Illustrata Article Record

Illustrata Object Record

GeoRef Database’s link to “Deep Indexing”

Abstract retrieved from GeoRef for "Aeronomy" and "maps”

Products That Index TableCONTENT • TableSeer (search engine; 2006+) • automatically identifies tables in digital documents and extracts the contents in the cells of the tables • contents are stored in a queryable table in a database which extracts table metadata and uses a novel ranking function to search for tables relevant to user queries • BioText Search Engine (freely available web-based application; 2007+) • searches over 300 open access journals • ability to search for words within a table

TableSeer is part of ChemxSeer http://chemxseer.ist.psu.edu/

BioText Search in Articles For: “hypercholesterolemia” & “Education”

Same BioText search in “Figure Captions” – Grid view

Same BioText search in “Tables”

So … What Does This All Mean for the T2D Project? Not exactly sure but perhaps, in seeing this trend in the Abstract & Indexing industry, we might investigate developing a “SocioText” type of product to index open access journals such as the Canadian Journal of Sociology = ??

So … What Else Needs To Be “Put On The Table”? What if the table information is insufficient and I want to look at entire dataset? Where is the entire dataset? Who owns the entire dataset? When will it become available for me to use? How can I get my hands on it?

Identific/cur/dur-ATION! • Personal Websites • Institutional Repositories • Subject-specific Repositories such as: • Dryad - http://datadryad.org/repo • ExLab - http://exlab.bus.ucf.edu AND THEN PERHAPS, there’s still: • Desk Drawers (aka: LOST)

So . . . What Do We Do Now? Hopefully I’ve been able to provide some context and/or “food for thought” and, well . . . stay tuned for updates!

Enhancing Access to Research Data: T2D Project Overview