510 likes | 722 Views
Capturing Human Information Interaction to Manage Scientific Knowledge. Shannon Bradshaw Department of Management Sciences Tippie College of Business The University of Iowa. Motivation. In Biology, 40,000 new articles are published each month (Fontanelo 2004)
E N D
Capturing Human Information Interaction to Manage Scientific Knowledge Shannon Bradshaw Department of Management Sciences Tippie College of Business The University of Iowa
Motivation • In Biology, 40,000 new articles are published each month (Fontanelo 2004) • Biologists spend as much as 20% of their time gathering information (Hayes 2004) • For evolutionary biologists, the problem is especially severe • A gene or protein of interest may lead them to organisms (areas of the literature) with which they have no familiarity
Real-world example • In recent wet lab work a graduate student discovered proteins called ankyrins in the organism he was studying • Learning the function of ankyrins and what is known about ankyrin repeats was critical to his research • He performed a time-consuming web/literature search to uncover in what organisms these proteins occur and what is known about them
Beedance • Personal information gathering and management • Text mining • Community-oriented knowledge sharing
Personal Information Gathering and Management • Problem: It is difficult for the individual biologists to assemble needed information and save it for later review • Approach: Client application that enables users to: • Assemble and organize: • Electronic documents • Annotations • Structured information • Save and search • Share
Text Mining • Problem: • Too many relevant articles to sift through • It is frequently not an entire article that an investigator needs but a sentence or paragraph within that article • Approach: Developing text mining tools that allow the user of the client application to: • Find relevant information faster • With greater coverage • Organize into an easily consumable format
Knowledge sharing/management • Problem: The same information is gathered and reviewed by hundreds and perhaps thousands of people, often by people in the same research laboratory • Approach: Developing a community-managed knowledge sharing service: • Gives users access to information assembled by others with similar needs • Explicit information sharing • Explicit and implicit searches • Framework for studying human information gathering, organization, and sharing behavior
Client Application with Text Mining BeedanceClient
Knowledge Sharing Server “Ankyrins” “Ankyrins” KA “Ankyrins” BeedanceServer “Ankyrins” “Ankyrins”
The Beedance Project • Co-directors: • Shannon Bradshaw and Marc Light • Contributing faculty: • David Eichmann (SLIS) • Debashish Bhattacharya (Biology) • Students: • Brian Almquist (MS) • Robert Arens (CS) • Nash Lincoln (CS) • Dylan Scott (MS undergrad) • Matthew Smalley (MS) • Hudong Wang (CS)
HCI / HII • A significant drawback to current PDF readers is that the reading task must be interrupted when one wants to annotate or navigate. • Exploring ways to blend navigation and annotation with the reading process so that fewer interruptions are necessary. • Brief demo
Research Question • Do the annotations of an individual early in a document help identify what he or she will be interested in seeing later in the document?
Automatic Highlighting Experiment • Passage retrieval - sentences • Two types of queries: • the keywords provided • the first highlighted passage • Expanded the query based on 1 definition per query word • Expansion based on multiple definitions per query word
Questions We Asked • How well does a standard retrieval work (okapi)? • What is a better query? • keywords • example passage • Does query expansion based on definitions help? • Do multiple definitions help more?
Data • 13 articles that hadalreadybeen highlighted • a mix of topics, a mix of computer- and paper-based highlighting, 2 biologists • Asked what the information need had been • evolution, coevolution “RNA worrld”, retroelement, “retroelement ancestor hypothesis”, mobility, mobile
Results MAP
Questions “Answered” • Standard retrieval: 0.23 MAP (not good) • Example highlighted regions are better than keywords specifying information need • Definition-based query expansion helps • Multiple definitions helps more
More Data To Come • Graduate seminar on evolutionary biology • 8 students all marked up the same 16 articles • All hardcopy markup • We have scanned the hardcopy markup • We have ASCII of the articles and are laboriously creating corresponding XML markup (using Callisto (thanks MITRE))
Information extraction Cell 2003 Jan 24;112(2):169-80 Twist Regulates Cytokine Gene Expression through a Negative Feedback Loop that Represses NF-kappaB Activity. Sosic D, Richardson JA, Yu K, Ornitz DM, Olson EN. During Drosophila embryogenesis, the dorsal transcription factor activates the expression of twist, a transcription factor required for mesoderm formation. We show here that the mammalian twist proteins, twist-1 and -2, are induced by a cytokine signaling pathway that requires the dorsal-related protein RelA, a member of the NF-kappaB family of transcription factors. Twist-1 and -2 repress cytokine gene expression through interaction with RelA. ... PMID: 12553906 [PubMed - in process] • Why: scientists are often interested in only a certain type of fact Info Extract
First pass • Sentence-level parsing (low-brow) • Poor performance • Hoping we do not need a full syntactic parse (high-brow) • Going for a “medium-brow” approach
Our Hypothesis • Relation extraction performance can be improved through the use of syntactic information which clusters words by the verbal element on which they depend • Dominion structure is the right amount of syntactic information: not too much, not too little
Dominion Structure [D We further [P observed] [D that [D a [P heterodimer] of TRbetaand RXRalpha,] either in solution or [D/N [P bound] to a DR+4 TRE,] [P recruited] SRC-1 in a [D/N T3-[P dependent]] manner.] ] Sentence-level parsing extracts all 10 combinations of TRbeta, RXR alpha, DR+4 TRE, SRC-1, T3 Dominion parsing would only extract 1 or 2
Semantic Description of Dominions • The rough idea is that verbal dominions correspond to propositions. • When a dominion does not contain any nested propositions, it is a first order proposition. • A dominion that contains dominions corresponds to second order propositions: propositions about propositions.
More on Predicates • ABC would have interacted with Z ... • the ABC-binding GD protein ... • Binding the STAT6, ABC causes ... • Ankyrin bound to DNA can repair it • binding the ankyrin repeat region causes ... • the interaction between AB and CD • AB seems to interact with CD
Our Experiment • Manually annotate dominion structure of sentences containing 85 protein pairs • Use the manual tagged structure plus heuristic to count how often protein pairs (both interacting and not) are in dominions • Compare to sentence-based approach
Results: Finding the 40 Positive Pairs • Recall is CP / (CP + FN) • What percentage of the positives did you find • Sentence-based system recall is 1 for the sample • Dominion-based system recall is 0.65
Error Analysis • Missed 14 • 8 had nesting problems • Xinteracted with Y family including Z • 3 were implicit statements • interaction ... corresponding to the second exon of Ad5E1A • 2 hopeless
Results: Finding the 45 Negative Pairs • Sentence-based system gets them all wrong • Dominion-based system gets 41 of 45 (!!!) • eliminates false positives • Precision is CP / (CP + FP) • What percentage of your guesses are correct? • FP is scaled: should have been more negatives • Sentence-based system precision: 12% • Dominion-based system precision: 50%
F-measure • F-measure: 2(prec*recall)/(prec+recall) • rewards balance • Sentence-based system F of 21% • Dominion-based system F of 57%
Knowledge Sharing “Ankyrins” “Ankyrins” KA “Ankyrins” BeedanceServer “Ankyrins” “Ankyrins”
Experiment • Seminar class on Molecular Phylogenetics • 8 graduate students in molecular biology and genomics • 8 weeks of lecture • 4 weeks of paper discussions
Data Collection • Class met twice weekly • Each class one student presented two papers on a topic usually germane to his or her research • Prior to the presentation (few days) the presenter would send out the two papers are along with a description of his talk • All students (including presenter) annotated the papers based on the talk topic
Data • Stacks of hardcopy • Annotations included • Highlighting • Underlining • Circling • Emphasis marks • Margin notes • Back o’ the page notes
Data processing • Scanned hardcopies to PDF (had to return papers to students) • Used Mitre’s Callisto tool to markup text documents using XML that reflects the annotations users made
Research Question • What is the degree of overlap between researchers reading the same paper for related reasons?