320 likes | 343 Views
Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley. 07 Aug 2007 University of Manchester. outline. motivation how to go about it first study biotext search engine future work current studies (brief).
E N D
Developing systems for full-text search in biomedicine. Anna Divoli School of Information University of California, Berkeley 07 Aug 2007 University of Manchester
outline • motivation • how to go about it • first study • biotext search engine • future work • current studies (brief)
motivation • All prominent literature search systems are based on information from abstracts alone when: • Researchers in the area of text-mining have started to investigate approaches for full-text analysis. • PubMed Central has made available a large collection of full-text articles (Open Access), overcoming licensing restrictions from the publishers. • Figure captions and figures can be especially useful for locating experimental results (e.g., 2002 KDD).
Design Evaluate Prototype how to go about it • Search engine interface that meets the unique needs of bioscientists • Provide flexible, useful, appealing search for bioscientists • Provide abstract? Full text? Figures? Expansions? Links? • How? User-centered search interface design. • Simple design • Usability studies • Iteration
hci principles • Design for the user, not for the designers or the system • Needs assessment: who users are what their goals are what tasks they need to perform • Task analysis: characterize what steps users need to take create scenarios of actual use decide which users and tasks to support • Iterate between: designing & evaluating
hci principles - cont. • Make use of cognitive principles where available • Important guidelines: Reduce memory load Speak the user’s language Provide helpful feedback Respect perceptual principles • Prototypes: Get feedback on the design faster Experiment with alternative designs Fix problems before code is written Keep the design centered on the user
first study - goals • Marti A. Hearst, Anna Divoli, Jerry Ye and Michael A. Wooldridge (2007) “Exploring the efficacy of caption search for bioscience journal search interfaces” ACL 2007 Workshop on BioNLP, Prague, Czech Republic • Primary Goal: Determine whether biological researchers would find the idea of caption search and figure display to be useful or not (evidence that figures and their captions are very informative). • Secondary Goal: Should caption search and figure display be useful, how best to support these features in the interface.
first study - description • When introducing a new search interface idea, great care must be taken to get the details right. • Practiced user-centered design: first prototype, then test the results with potential users, then refine the design based on their responses, and repeat… • Tested a few designs. • ~1hour sessions with participants. • Standardized questions & open discussions (design & content).
full text with figure caption & figure caption, figure & small thumbnails grid full text first study - designs
first study - outcomes • 7 out of 8 in favor of caption-search & figure-display. • Different views serve different roles (for general concepts, abstract search would be more suitable but for a specific method, caption view would be better). • Best to show all the thumbnails as the result of a full-text or abstract-text search. • Info as few clicks away as possible with as few ‘distracting’ links & options. • More metadata for the grid view. • All participants favored the ability to browse all figures from a paper once they find the abstract or one of the figures relevant to their query.
Caption-Figure design first study - outcomes cont. x-axis: participant # y-axis: 7 = strongly agree 1 = strong disagree
biotext search engine Marti A. Hearst, Anna Divoli, Harendra Guturu, Alex Ksikes, Preslav Nakov, Michael A. Wooldridge and Jerry Ye (2007) “BioText Search Engine: beyond abstract search” Bioinformatics; doi:10.1093/bioinformatics/btm301
biotext search engine - the views • The interface is carefully designed according to usability principles and techniques. • The BioText Search Engine allows users to search in: • Abstracts - list view • Captions - list view • Captions - grid view • Clicking on any figure opens a new window with a large version of the figure accompanied with its caption.
abstracts - list view • Searches in: TITLES ABSTRACTS AUTHOR NAMES • Returns in a list: TITLE CITATION ABSTRACT thumbnails of FIGURES from the paper HTML & PDF links to the paper link to the “Endgame view”
captions - list view • Searches in: CAPTIONS • Returns in a list: TITLE CITATION CAPTION corresponding FIGURE HTML & PDF links to the paper link to the “Endgame view”
captions - grid view • Searches in: CAPTIONS • Returns: corresponding FIGURES in a grid short CAPTION excerpt CITATION in tooltip link to the “Endgame view”
endgame view • All views lead to the Endgame view. • The “Endgame view” displays a summary of a paper that an abstract or caption originates. This summary comprises of: • TITLE • CITATION • ABSTRACT • all FIGURES and corresponding CAPTIONS of the paper in a list • HTML & PDF links to the paper
technical details • Indexes all Open Access articles available at PMC - collection consists of more than 150 journals, 20,000 articles, and 80,000 figures (new articles are downloaded and indexed daily). • The figures are stored locally (to present thumbnails quickly). • The Lucene open source search engine issued to index, retrieve, and rank the text (using default statistical ranking). • Publication date is stored as a separate field and can also be used to sort the results. • The interface is web-based and is implemented in python and PHP, logs and other information are stored using MySQL.
future work • The search engine is a work in progress. More functionality will be added over time. We plan to: • Provide full-text search. • (Since the usability of different ranking functions for biology articles is still not well-understood, we plan to do usability testing, research how different sections should be weighted differently for different query types and investigate how best to show excerpts or summaries from full text before supporting this feature.) • Augment the caption search by indexing the parts of the full text that refer to the caption. • Provide search over table captions.
future work cont. • Incorporate topical features such as genes/proteins and organisms. • For the grid view, we plan to provide grouping according to categories that are of interest to biologists, such as “sequence alignments” and “phylogenetic trees”. • (We are building a classifier for figures and their captions. We have developed an image annotation interface and are soliciting help with hand-labeling mated caption classifier.) • Additional future developments on the BioText Search Engine will depend on feedback and requests we receive from users, and the results of extensive usability testing.
current study in brief • (online surveys) • First part: Biological Information Preferences • Second part: Gene/Protein Name Expansion Preferences • Whether or not bioscience literature searchers wish to see related term suggestions, in particular, gene and protein names • We plan to assess presentation of other results of text analysis, such as the entities corresponding to diseases, pathways, gene interactions, localization information, function information, and so on. • Assess the usability of one feature at a time, see how participants respond, and then test out other features
results of current study in brief (based on 38 responses, numerous specializations) Related Information Type Avg rating # selecting 1 or 2 Gene’s Synonyms 4.4 2 Gene’s Synonyms refined by organism 4.0 2 Gene’s Homologs 3.7 5 Genes from same family: parents 3.4 7 Genes from same family: children 3.6 4 Genes from same family: siblings 3.2 9 Related Information Type Avg rating # selecting 1 or 2 Genes this gene interacts with 3.7 4 Diseases this gene is associated with 3.4 6 Chemicals/drugs this gene is associated with 3.2 8 Localization information for this gene 3.7 3 1 2345 (Do NOT want this) (Neutral) (REALLY want this)
results of current study in brief - cont. • Strong desire for the search system to suggest information closely related to gene/protein names. • Some interest in less closely related information . • Most participants want to see organism names in conjunction with gene names. • A majority of participants prefer to see term suggestions grouped by type. • Split in preference between single-click hyperlink interaction and checkbox-style interaction. • Need to experiment with hybrid designs, e.g., checkboxes for the individual terms and a link that immediately adds all terms in the group and executes the query. • Adding more information will require a delicate balancing act between usefulness and clutter.
acknowledgements • Marti Hearst • Mike Wooldridge • Jerry Ye • Preslav Nakov • Harendra Gututru • Supported by NSF DBI-0317510 • Available at: http://biosearch.berkeley.edu