1 / 36

The BioText Project

The BioText Project aims to provide intelligent access to biosciences information through sophisticated text analysis, annotations, and a user-friendly search interface. The project focuses on recognizing entities, identifying semantic relations, and developing innovative algorithms for fast and accurate data retrieval in biosciences. By leveraging computational linguistics and database research, the BioText Project offers a comprehensive solution for navigating complex scientific literature efficiently.

leecharles
Download Presentation

The BioText Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The BioText Project Myers Seminar Sept 22, 2003 Marti Hearst Associate Professor SIMS, UC Berkeley Projected sponsored by NSF DBI-0317510, ARDA AQUAINT, and a gift from Genentech

  2. BioText Project Goals • Provide fast, flexible, intelligent access to information for use in biosciences applications. • Focus on • Textual Information • Tightly integrated with other resources • Ontologies • Record-based databases

  3. People • Project Leaders: • PI: Marti Hearst Co-PI: Adam Arkin • Computational Linguistics • Barbara Rosario • Presley Nakov • Database Research • Ariel Schwartz • Gaurav Bhalotia (graduated) • User Interface / Information Retrieval • Kevin Li • Emilia Stoica • Bioscience • Dr. TingTing Zhang

  4. Outline • Main Goals • System Architecture • Apoptosis problem statement • Recent results in • Abbreviation definition recognition • Semantic relation recognition (from text) • Search User Interfaces • Hierarchical grouping of journals

  5. BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface

  6. Recent Result (Schwartz & Hearst 03) • Fast, simple algorithm for recognizing abbreviation definitions. • Simpler and faster than the rest • Higher precision and recall • Idea: Work backwards from the end • Examples: • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Gcn5-related N-acetyltransferase (GNAT) • Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present.

  7. Blast Medline Mesh SwissProt Word Net GO Journal Full Text BioText: A Two-Sided Approach Empirical Computational Linguistics Algorithms Sophisticated Database Design & Algorithms

  8. Death Receptors Signaling Ca++ Signaling Effecter Caspases (3,6,7) Apoptosis Network Survival Factors Signaling Genotoxic Stress Lost of Attachment Cell Cycle stress, etc ER Stress Initiator Caspases (8, 10) P53 pathway BH3 only Bcl-2 like NFkB Bax, Bak Mitochondria Cytochrome c Smac Caspase 12 IAPs Apaf 1 AIF Caspase 9 Apoptosis Slide courtesy TingTing Zhang

  9. The issues (courtesy TingTing Zhang): • The network nodes are deduced from reading and processing of experimental knowledge by experts. Every month >1000 apoptosis papers are published. • The supporting experimental data are gathered in different organs, tissues, cells using various techniques. • There are various levels of uncertainty associated with different techniques used to answer certain questions. • Depending on the expression patterns for the players in the network, the observation may or may not be extended to other contexts. • We need to keep track of ALL the information in order to understand the system better.

  10. Simple cases: • Mouse Bim proteins (isoforms EL, L, S) binds tohuman Bcl-2 (bacteriophoage screeningusingcDNA expression library from T-Lymphoma cell line KO52DA20). • Human BimEL proteinis 89% identical tomouse BimEL, Human BimLis 85% identicaltomouse BimL (Hybridization of mouse bim cDNA tohuman fetal spleen and peripheral blood cDNA library). • Bim mRNAis detectedin B and T lyphoid cells (Northern blot analysisofmouse KO52DA20, WEHI 703, WEHI 707, WEHI7.1, CH1, WEHI231 WEHI415, B6.23.16BW2 cell extracts). • BimL proteininteract withBcl-2 OR Bcl-XL, or Bcl-w proteins (Immuno-precipitation(anti-Bcl-2 OR Bcl-XL OR Bcl-w))followed by Western blot(anti-EEtag) using extracts human 293T cellsco-transfected with EE-tagged BimL AND (bcl-2 OR bcl-XL OR bcl-w) plasmids) • BimL deleted of the BH3 domaindoes not bind to Bcl-2 OR Bcl-XL, or Bcl-w proteins (under experimental conditions mentioned above)

  11. Computational Language Goals • Recognizing and annotating entities within textual documents • Identifying semantic relations among entities • To (eventually) be used in tandem with semi-automated reasoning systems.

  12. Main Ideas for NLP Approach • Assign Semantics using • Statistics • Hierarchical Lexical Ontologies to generalize • Redundancy in the data • Build up Layers of Representation • Syntactic and Semantic • Use these in a feedback loop

  13. Computational Linguistics Goals • Mark up text with semantic relations

  14. Recent Result:Descent of Hierarchy • Idea: • Use the top levels of a lexical hierarchy to identify semantic relations • Hypothesis: • A particular semantic relation holds between all 2-word Noun Compounds that can be categorized by a MeSH pair.

  15. Definition • NC: Any sequence of nouns that itself functions as a noun • asthma hospitalizations • health care personnel hand wash • Technical text is rich with NCs Open-labeled long-term study of the subcutaneous sumatriptan efficacy and tolerability in acute migraine treatment.

  16. NCs: Three tasks • Identification • Syntactic analysis (attachments) • [Baseline [headachefrequency]] • [[Tensionheadache] patient] • Our Goal: Semantic analysis • Headache treatment treatment forheadache • Corticosteroid treatment treatment that uses corticosteroid

  17. Main Idea: • Top-level MESH categories can be used to indicate which relations hold between noun compounds • headache recurrence • C23.888.592.612.441 C23.550.291.937 • headache pain • C23.888.592.612.441 G11.561.796.444 • breast cancer cells • A01.236 C04 A11

  18. Linguistic Motivation Can cast NC into head-modifier relation, and assume head noun has an argument and qualia structure. • (used-in): kitchen knife • (made-of): steel knife • (instrument-for): carving knife • (used-on): putty knife • (used-by): butcher’s knife

  19. Distribution of Frequent Category Pairs

  20. How Far to Descend? • Anatomy: 250 CPs • 187 (75%) remain first level • 56 (22%) descend one level • 7 (3%) descend two levels • Natural Science (H01): 21 CPs • 1 (4%) remain first level • 8 (39%) descend one level • 12 (57%) descend two levels • Neoplasm (C04) 3 CPs: • 3 (100%) descend one level

  21. Evaluation • Apply the rules to a test set • Accuracy: • Anatomy: 91% accurate • Natural Science: 79% • Diseases: 100% • Total: • 89.6% via intra-category averaging • 90.8% via extra-category averaging

  22. Summary of NC Work • Lexical hierarchy useful for inferring semantic relations • Works because semantics are constrained and word sense ambiguity is not too much of a problem • Can it be extended to other types of relations? • Preliminary results on one set of relations are promising.

  23. Database Research Issues • Efficiently and effectively combining • Relational databases & Text • Hierarchical Ontologies • Layers of Annotations

  24. Interface Issues • Create intuitive, appealing interfaces that are better than what’s currently out there. • Start with existing assigned metadata • As text analysis improves, incorporate the results into the interface.

  25. Some Recent Work • Organizing BioScience Journal Names • Currently there are > 3500

  26. Some Recent Work • Organizing BioScience Journal Names • Currently there are > 3500 • Idea: • Group them into faceted hierarchies semi-automatically • Using clustering of title terms, synonym similarity via WordNet, and other techniques

  27. Summary • BioText aims to improve access to bioscience information via • Sophisticated language analysis • Integration of results into • Annotated database • Flexible user interface • Eventual goal • Semi-automated mining and discovery

  28. There’s lots to do! biotext.berkeley.edu For more information:

More Related