1 / 54

BeeSpace Informatics Research

BeeSpace Informatics Research. ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign. BeeSpace Workshop, May 22, 2009. Goal of Informatics Research.

Download Presentation

BeeSpace Informatics Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. BeeSpace Informatics Research ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology Statistics Graduate School of Library & Information Science University of Illinois at Urbana-Champaign BeeSpace Workshop, May 22, 2009

  2. Goal of Informatics Research • Develop general and scalable computational methods to enable • Semantic integration of data and information • Effective information access and exploration • Knowledge discovery • Hypothesis formulation and testing • Reinforcement of research in biology and computer science • CS research to automate manual tasks of biologests • Biology research to raise new challenges for CS

  3. Overview of BeeSpace Technology Users Knowledge Discovery & Hypothesis Testing Question Answering Gene Summarizer Function Annotator Information Access & Exploration Space/Region Manager, Navigation Support Text Miner Search Engine Relational Database Words/Phrases Entities Relations Content Analysis Natural Language Understanding Meta Data Literature Text

  4. Informatics Research Accomplishments Users Knowledge Discovery & Hypothesis Test Question Answering Automatic Function Annotation [He et al. 09/10] Entity/Gene Summarization [Ling et al. 06], [Ling et al. 07], [Ling et al. 08] Gene Summarizer Function Annotator Topic discovery and interpretation [Mei et al. 06a], [Mei et al. 07a], [Mei et al. 07b], [Chee & Schatz 08] Information Access & Exploration Space/Region Manager, Navigation Support Biomedical information retrieval [Jiang & Zhai 07], [Lu et al. 08] Text Miner Search Engine Relational Database Words/Phrases Entities Relations Content Analysis Entity/Relation extraction [Jiang & Zhai 06], [Jiang & Zhai 07a], [Jiang & Zhai 07b] Natural Language Understanding Meta Data Literature Text

  5. Overview of BeeSpace Technology Users Knowledge Discovery & Hypothesis Testing Part 3. Entity Summarization Part 4. Function Analysis Question Answering Gene Summarizer Function Annotator Information Access & Exploration Space/Region Manager, Navigation Support Part 2. Navigation Support Text Miner Search Engine Relational Database Words/Phrases Entities Relations Content Analysis Part 1. Information Extraction Natural Language Understanding Meta Data Literature Text

  6. Part 1. Information Extraction

  7. NP VP VP NP NP Gene Gene VP NP Natural Language Understanding …We have cloned and sequenced a cDNA encoding Apis mellifera ultraspiracle (AMUSP) and examined its responses to …

  8. Entity & Relation Extraction Lopes FJ et al., 2005 J. Theor. Biol. Genetic Interaction Expression Location … 8

  9. General Approach: Machine Learning • Computers learn from labeled examples to compute a function to predict labels of new examples • Examples of predictions • Given a phrase, predict whether it is a gene name • Given a sentence with two gene names mentioned, predict whether there is a genetic interaction relation • Many learning methods are available, but training data isn’t always available

  10. Extraction Example 1: Gene Name Recognition Gene? … expression of terminal gap genes is mediated by the local activation of the Torso receptor tyrosine kinase (Tor). At the anterior, terminal gap genes are also activated by the Tor pathway but Bcd contributes to their activation. Gene? Gene?

  11. Features for Recognizing Genes • Syntactic clues: • Capitalization (especially acronyms) • Numbers (gene families) • Punctuation: -, /, :, etc. • Contextual clues: • Local: surrounding words such as “gene”, “encoding”, “regulation”, “expressed”, etc. • Global: same noun phrase occurs several times in the same article

  12. Maximum Entropy Modelfor Gene Tagging • Given an observation (a token or a noun phrase), together with its context, denoted as x • Predict y {gene, non-gene} • Maximum entropy model: P(y|x) = K exp(ifi(x, y)) • Typical f: • y = gene & candidate phrase starts with a capital letter • y = gene & candidate phrase contains digits • Estimate i with training data

  13. Special Challenges • Gene name disambiguation • Domain adaptation

  14. Gene Name Disambiguation • Gene names can be common English words: for (foraging), in (inturned), similar (sima), yellow (y), black (b)… • Solution: • Disambiguate by looking at the context of the candidate word • Train a classifier

  15. Discriminative Neighbor Words

  16. Sample Disambiguation Results the cuticular melanization phenotype of black flies is rescued by beta-alanine but -2.780 beta-alanine production by aspartate decarboxylation was reported to be normal in assays of black mutants and although … +9.759 “black” ... affect complex behaviors such as locomotion and foraging. The foraging -1.468 +3.359 (for) gene encodes a pkg in drosophila melanogaster here we demonstrate a +5.497 function for the for gene in sensory responsiveness and … -0.582 +5.980 “foraging”, “for”

  17. Problem of Domain Overfitting ideal setting gene name recognizer 54.1% wingless daughterless eyeless apexless … fly realistic setting gene name recognizer 28.1%

  18. Solution: Learn Generalizable Features …decapentaplegic and wingless are expressed in analogous patterns in each primordium of… …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. • Generalizable Feature: “w+2 = expressed”

  19. Generalizability-Based Feature Ranking training data … 1 2 3 4 5 6 7 8 … … -less … … expressed … … 1 2 3 4 5 6 7 8 … … … expressed … … … -less 1 2 3 4 5 6 7 8 … … … expressed … … -less … … 1 2 3 4 5 6 7 8 … … … … expressed … … -less … expressed … … … -less … … … 0.125 … … … 0.167 … …

  20. Effectiveness of Domain Adaptation standard learning gene name recognizer Yeast Fly + Mouse 63.3% domain adaptive learning gene name recognizer Yeast Fly + Mouse 75.9%

  21. More Results on Domain Adaptation • Text data from BioCreAtIvE (Medline) • 3 organisms (Fly, Mouse, Yeast)

  22. Extraction Example 2: Genetic Interaction Relation Gene Is there a genetic interaction relation here? Bcd regulates the expression of the maternal and zygotic gene hunchback (hb) that shows a step-like-function expression pattern, in the anterior half of the egg. Gene

  23. Challenges No/little training data What features to use?

  24. Solution: Pseudo Training Data Gene: Bcd + These results uncovered an antagonism between hunchback and bicoid at the anterior pole, whereas the two genes are known to act in concert for most anterior segmented development.

  25. Pseudo Training Data Works Reasonably Well Precision Using all features works the best Recall

  26. Large-Scale Entity/Relation Extraction Entity annotation Relation extraction 53

  27. Part 2: Semantic Navigation

  28. Intersection, Union,… Fly Rover Bird Singing EXTRACT EXTRACT MAP MAP SWITCHING Intersection, Union,… Space-Region Navigation … Topic Regions My Regions/Topics Bee Forager … Bee Bird Fly My Spaces Behavior Literature Spaces

  29. General Approach: Language Models • Topic = word distribution • Modeling text in a space with mixture models of multinomial distributions • Text Mining = Parameter Estimation + Inferences • Matching = Computer similarity between word distributions • Users can “control” a model by specifying topic preferences

  30. A Sample Topic & Corresponding Space Word Distribution (language model) Meaningful labels labels actin filaments flight muscle flight muscles filaments 0.0410238 muscle 0.0327107 actin 0.0287701 z 0.0221623 filament 0.0169888 myosin 0.0153909 thick 0.00968766 thin 0.00926895 sections 0.00924286 er 0.00890264 band 0.00802833 muscles 0.00789018 antibodies 0.00736094 myofibrils 0.00688588 flight 0.00670859 images 0.00649626 Example documents • actin filaments in honeybee-flight muscle move collectively • arrangement of filaments and cross-links in the bee flight muscle z disk by image analysis of oblique sections • identification of a connecting filament protein in insect fibrillar flight muscle • the invertebrate myosin filament subfilament arrangement of the solid filaments of insect flight muscles • structure of thick filaments from insect flight muscle

  31. MAP: Topic/RegionSpace • MAP: Use the topic/region description as a query to search a given space • Retrieval algorithm: • Query word distribution: p(w|Q) • Document word distribution: p(w|D) • Score a document based on similarity of Q and D • Leverage existing retrieval toolkits: Lemur/Indri

  32. EXTRACT: Space Topic/Region • Assume k topics, each being represented by a word distribution • Use a k-component mixture model to fit the documents in a given space (EM algorithm) • The estimated k component word distributions are taken as k topic regions Likelihood: Maximum likelihood estimator: Bayesian estimator:

  33. User-Controlled Exploration: Sample Topic 1 age 0.0672687 division 0.0551497 labor 0.052136 colony 0.038305 foraging 0.0357817 foragers 0.0236658 workers 0.0191248 task 0.0190672 behavioral 0.0189017 behavior 0.0168805 older 0.0143466 tasks 0.013823 old 0.011839 individual 0.0114329 ages 0.0102134 young 0.00985875 genotypic 0.00963096 social 0.00883439 Prior: labor 0.2 division 0.2

  34. User-Controlled Exploration: Sample Topic 2 behavioral 0.110674 age 0.0789419 maturation 0.057956 task 0.0318285 division 0.0312101 labor 0.0293371 workers 0.0222682 colony 0.0199028 social 0.0188699 behavior 0.0171008 performance 0.0117176 foragers 0.0110682 genotypic 0.0106029 differences 0.0103761 polyethism 0.00904816 older 0.00808171 plasticity 0.00804363 changes 0.00794045 Prior: behavioral 0.2 maturation 0.2

  35. Exploit Prior for Concept Switching foraging 0.142473 foragers 0.0582921 forage 0.0557498 food 0.0393453 nectar 0.03217 colony 0.019416 source 0.0153349 hive 0.0151726 dance 0.013336 forager 0.0127668 information 0.0117961 feeder 0.010944 rate 0.0104752 recruitment 0.00870751 individual 0.0086414 reward 0.00810706 flower 0.00800705 dancing 0.00794827 behavior 0.00789228 foraging 0.290076 nectar 0.114508 food 0.106655 forage 0.0734919 colony 0.0660329 pollen 0.0427706 flower 0.0400582 sucrose 0.0334728 source 0.0319787 behavior 0.0283774 individual 0.028029 rate 0.0242806 recruitment 0.0200597 time 0.0197362 reward 0.0196271 task 0.0182461 sitter 0.00604067 rover 0.00582791 rovers 0.00306051

  36. Part 3: Entity Summarization

  37. Multi-Aspect Gene Summary Gene product Expression Sequence Interactions Mutations General Functions Automated Gene Summarization?

  38. A Two-Stage Approach

  39. Text Summary of Gene Abl

  40. General Entity Summarizer • Task: Given any entity and k aspects to summarize, generate a semi-structured summary • Assumption: Training sentences available for each aspect • Method: • Train a recognizer for each aspect • Given an entity, retrieve sentences relevant to the entity • Classify each sentence into one of the k aspects • Choose the best sentences in each category

  41. Further Generalizations • Task: Given any entity and k pre-specified aspects to summarize, generate a semi-structured summary • Assumption: Training sentences available for each aspect • Method: • Train a recognizer for each aspect • Given an entity, retrieve sentences relevant to the entity • Classify each sentence into one of the k aspects • Choose the best sentences in each category New method based on mixture model and regularized optimization

  42. Part 4. Function Analysis

  43. Annotating Gene Lists: GO Terms vs. Literature Mining • Limitations of GO annotations: • Labor-intensive • Limited Coverage • Literature Mining: • - Automatic • - Flexible exploration in the entire literature space

  44. Bcd Cad Tll For any term: test its significance Segmentation 56.0 Pattern 34.2 Cell_cycle 25.6 Development 22.1 Regulation 20.4 … For any gene: retrieve its relevant documents Bcd Cad … Tll Interactive analysis … Gene group Enriched concepts Entrez Gene Document sets Overview of Gene List Annotator

  45. Intuition for Literature-based Annotation

  46. Likelihood Ratio Test with 2-Poisson Mixture Model Reference distribution: Poisson(λ0;d) Dataset distribution: Poisson(λ;d)

  47. Agreement with GO-based Method • Gene List: 93 genes up-regulated by the manganese treatment

  48. Discovering Novel Themes • Gene List: 69 genes up-regulated by the methoprene treatment

  49. Summary Users Knowledge Discovery & Hypothesis Testing Part 3. Entity Summarization Part 4. Function Analysis Machine Learning + Language Models + Minimum Human Effort General and scalable, but there’s room for deeper semantics Question Answering Gene Summarizer Function Annotator Information Access & Exploration Space/Region Manager, Navigation Support Part 2. Navigation Support Text Miner Search Engine Relational Database Words/Phrases Entities Relations Content Analysis Part 1. Information Extraction Natural Language Understanding Meta Data Literature Text

  50. Looking Ahead… • Knowledge integration, inferences • Support for hypothesis formulation and testing

More Related