1 / 51

Capturing Human Information Interaction to Manage Scientific Knowledge

Capturing Human Information Interaction to Manage Scientific Knowledge. Shannon Bradshaw Department of Management Sciences Tippie College of Business The University of Iowa. Motivation. In Biology, 40,000 new articles are published each month (Fontanelo 2004)

rowdy
Download Presentation

Capturing Human Information Interaction to Manage Scientific Knowledge

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Capturing Human Information Interaction to Manage Scientific Knowledge Shannon Bradshaw Department of Management Sciences Tippie College of Business The University of Iowa

  2. Motivation • In Biology, 40,000 new articles are published each month (Fontanelo 2004) • Biologists spend as much as 20% of their time gathering information (Hayes 2004) • For evolutionary biologists, the problem is especially severe • A gene or protein of interest may lead them to organisms (areas of the literature) with which they have no familiarity

  3. Real-world example • In recent wet lab work a graduate student discovered proteins called ankyrins in the organism he was studying • Learning the function of ankyrins and what is known about ankyrin repeats was critical to his research • He performed a time-consuming web/literature search to uncover in what organisms these proteins occur and what is known about them

  4. Beedance • Personal information gathering and management • Text mining • Community-oriented knowledge sharing

  5. Why Beedance?

  6. Personal Information Gathering and Management • Problem: It is difficult for the individual biologists to assemble needed information and save it for later review • Approach: Client application that enables users to: • Assemble and organize: • Electronic documents • Annotations • Structured information • Save and search • Share

  7. Text Mining • Problem: • Too many relevant articles to sift through • It is frequently not an entire article that an investigator needs but a sentence or paragraph within that article • Approach: Developing text mining tools that allow the user of the client application to: • Find relevant information faster • With greater coverage • Organize into an easily consumable format

  8. Knowledge sharing/management • Problem: The same information is gathered and reviewed by hundreds and perhaps thousands of people, often by people in the same research laboratory • Approach: Developing a community-managed knowledge sharing service: • Gives users access to information assembled by others with similar needs • Explicit information sharing • Explicit and implicit searches • Framework for studying human information gathering, organization, and sharing behavior

  9. Client Application with Text Mining BeedanceClient

  10. Knowledge Sharing Server “Ankyrins” “Ankyrins” KA “Ankyrins” BeedanceServer “Ankyrins” “Ankyrins”

  11. The Beedance Project • Co-directors: • Shannon Bradshaw and Marc Light • Contributing faculty: • David Eichmann (SLIS) • Debashish Bhattacharya (Biology) • Students: • Brian Almquist (MS) • Robert Arens (CS) • Nash Lincoln (CS) • Dylan Scott (MS undergrad) • Matthew Smalley (MS) • Hudong Wang (CS)

  12. Client Application

  13. HCI / HII • A significant drawback to current PDF readers is that the reading task must be interrupted when one wants to annotate or navigate. • Exploring ways to blend navigation and annotation with the reading process so that fewer interruptions are necessary. • Brief demo

  14. Client Application

  15. Research Question • Do the annotations of an individual early in a document help identify what he or she will be interested in seeing later in the document?

  16. Automatic Highlighting Experiment • Passage retrieval - sentences • Two types of queries: • the keywords provided • the first highlighted passage • Expanded the query based on 1 definition per query word • Expansion based on multiple definitions per query word

  17. Questions We Asked • How well does a standard retrieval work (okapi)? • What is a better query? • keywords • example passage • Does query expansion based on definitions help? • Do multiple definitions help more?

  18. Web Definition Sets

  19. Data • 13 articles that hadalreadybeen highlighted • a mix of topics, a mix of computer- and paper-based highlighting, 2 biologists • Asked what the information need had been • evolution, coevolution “RNA worrld”, retroelement, “retroelement ancestor hypothesis”, mobility, mobile

  20. More on the Data

  21. Results MAP

  22. Significance test

  23. Questions “Answered” • Standard retrieval: 0.23 MAP (not good) • Example highlighted regions are better than keywords specifying information need • Definition-based query expansion helps • Multiple definitions helps more

  24. More Data To Come • Graduate seminar on evolutionary biology • 8 students all marked up the same 16 articles • All hardcopy markup • We have scanned the hardcopy markup • We have ASCII of the articles and are laboriously creating corresponding XML markup (using Callisto (thanks MITRE))

  25. Information extraction Cell 2003 Jan 24;112(2):169-80 Twist Regulates Cytokine Gene Expression through a Negative Feedback Loop that Represses NF-kappaB Activity. Sosic D, Richardson JA, Yu K, Ornitz DM, Olson EN. During Drosophila embryogenesis, the dorsal transcription factor activates the expression of twist, a transcription factor required for mesoderm formation. We show here that the mammalian twist proteins, twist-1 and -2, are induced by a cytokine signaling pathway that requires the dorsal-related protein RelA, a member of the NF-kappaB family of transcription factors. Twist-1 and -2 repress cytokine gene expression through interaction with RelA. ... PMID: 12553906 [PubMed - in process] • Why: scientists are often interested in only a certain type of fact Info Extract

  26. Some Relations that would be Useful to Extract

  27. First pass • Sentence-level parsing (low-brow) • Poor performance • Hoping we do not need a full syntactic parse (high-brow) • Going for a “medium-brow” approach

  28. Our Hypothesis • Relation extraction performance can be improved through the use of syntactic information which clusters words by the verbal element on which they depend • Dominion structure is the right amount of syntactic information: not too much, not too little

  29. Dominion Structure [D We further [P observed] [D that [D a [P heterodimer] of TRbetaand RXRalpha,] either in solution or [D/N [P bound] to a DR+4 TRE,] [P recruited] SRC-1 in a [D/N T3-[P dependent]] manner.] ] Sentence-level parsing extracts all 10 combinations of TRbeta, RXR alpha, DR+4 TRE, SRC-1, T3 Dominion parsing would only extract 1 or 2

  30. Semantic Description of Dominions • The rough idea is that verbal dominions correspond to propositions. • When a dominion does not contain any nested propositions, it is a first order proposition. • A dominion that contains dominions corresponds to second order propositions: propositions about propositions.

  31. More on Predicates • ABC would have interacted with Z ... • the ABC-binding GD protein ... • Binding the STAT6, ABC causes ... • Ankyrin bound to DNA can repair it • binding the ankyrin repeat region causes ... • the interaction between AB and CD • AB seems to interact with CD

  32. Our Experiment • Manually annotate dominion structure of sentences containing 85 protein pairs • Use the manual tagged structure plus heuristic to count how often protein pairs (both interacting and not) are in dominions • Compare to sentence-based approach

  33. Results: Finding the 40 Positive Pairs • Recall is CP / (CP + FN) • What percentage of the positives did you find • Sentence-based system recall is 1 for the sample • Dominion-based system recall is 0.65

  34. Error Analysis • Missed 14 • 8 had nesting problems • Xinteracted with Y family including Z • 3 were implicit statements • interaction ... corresponding to the second exon of Ad5E1A • 2 hopeless

  35. Results: Finding the 45 Negative Pairs • Sentence-based system gets them all wrong • Dominion-based system gets 41 of 45 (!!!) • eliminates false positives • Precision is CP / (CP + FP) • What percentage of your guesses are correct? • FP is scaled: should have been more negatives • Sentence-based system precision: 12% • Dominion-based system precision: 50%

  36. F-measure • F-measure: 2(prec*recall)/(prec+recall) • rewards balance • Sentence-based system F of 21% • Dominion-based system F of 57%

  37. Knowledge Artifact

  38. Knowledge Sharing “Ankyrins” “Ankyrins” KA “Ankyrins” BeedanceServer “Ankyrins” “Ankyrins”

  39. Experiment • Seminar class on Molecular Phylogenetics • 8 graduate students in molecular biology and genomics • 8 weeks of lecture • 4 weeks of paper discussions

  40. Data Collection • Class met twice weekly • Each class one student presented two papers on a topic usually germane to his or her research • Prior to the presentation (few days) the presenter would send out the two papers are along with a description of his talk • All students (including presenter) annotated the papers based on the talk topic

  41. Data • Stacks of hardcopy • Annotations included • Highlighting • Underlining • Circling • Emphasis marks • Margin notes • Back o’ the page notes

  42. Data processing • Scanned hardcopies to PDF (had to return papers to students) • Used Mitre’s Callisto tool to markup text documents using XML that reflects the annotations users made

  43. Research Question • What is the degree of overlap between researchers reading the same paper for related reasons?

More Related