Automated Gene Summary: Let the Computer Summarize the Knowledge

Automated Gene Summary:Let the Computer Summarize the Knowledge Xu Ling Department of Computer Science University of Illinois at Urbana-Champaign

The Reality of Scientific Literature Hard to keep up manual curation!

Automated Gene Summarization Gene summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gene product Expression Sequence Interactions Mutations General Functions

Goal • To retrieve and summarize all the knowledge about a particular gene from the literature • Compressing knowledge: enables biologists to quickly understand the target gene. • Automated curation: explicitly covers multiple aspects of a gene, such as the sequence information, mutant phenotypes etc.

Semi-structured summary on multiple aspects Gene products Expression pattern Sequence information Phenotypic information Genetical/physical interactions … 2-stage summarization Retrieve relevant articles by gene name search Extract most informative and relevant sentences for each aspects. Our Solution

Text Summary of Gene Abl

System Overview: 2-stage Gene name recognition Sentence Categorization

Gene Name Recognition • v1: Dictionary-based string match • High recall, low precision • v2: Machine learning methods of gene name recognition • High precision, low recall • v3: v2 + dictionary based synonym expansion • Improved in both recall and precision

Categorization of Retrieved Sentences • Collect “example sentences” from FlyBase • v1: applying vector space model to construct aspect “profile”. • v2: applying probabilistic models to factor out context-specific language. • v3: v2 + biologist labeled training examples. Real sentence! Many thanks for the help by Susan Brown’s “Beetle group” !

Example. 1

Example. 2

Gene Summary in BeeSpace v4 • To add

General Entity Summarization • General and applicable to summarize other entities: pathways, protein family, … • General settings: • Space: A set of documents to be summarized. • Aspects: A set of aspects to define the structure of the summary. • Examples: Training sentences for each aspect.

Further Generalization … • Limitations of the categorization problem with training examples • Predefined aspects, may not fit the need of a particular user • Only works for a predefined domain and topics • Training examples for each aspect are often unavailable • More Realistic New Setup • Allow a user to flexibly describe each facet with keywords (1-2): let the user determine what they want • Generate the summary in a semi-supervised way: no need of training examples

Example (1): Consumer vs. Editor Honda accord 2006

Example (2): Different Aspects 17 • What if the users want an overview with different facets?

Conclusion • The generated summaries are • directly useful to biologists, • and also serve as entry points to enable them to quickly navigate relevant literatures, • via the BeeSpace analysis environment available at www.beespace.uiuc.edu

Start from Here … • The reverse of automated entity summarization: automated entity retrieval • Profiling of entities using entity summary Eg.,what genes are associated with … ? • Build a powerful knowledge base … • Enriched entities under certain context Eg.,what are the significantly enriched genes in …? • Entities involved in certain biomedical relations Eg.,what genes are interacting with gene X ? BeeSpace v5 !

Acknowledgement Bruce Schatz Gene Robinson Chengxiang Zhai Xin He Jing Jiang Qiaozhu Mei Moushumi Sarma

Vector Space Model (VSM) • Construct a corresponding term vector Vc using the training sentences for the aspect • The weight of a term ti in the aspect term vector for aspect j: wij=TFijIDFi, where TFij= term frequency, IDFi= 1 + log(N/ni) is the inverse document frequency (N=total number of documents, ni=number of documents containing term ti). • Construct a sentence term vector Vs for each sentence • with the same IDF and TF=number of times a term occurs in the sentence • Aspect relevance score S=cos(Vc, Vs).

Automated Gene Summary: Let the Computer Summarize the Knowledge

Automated Gene Summary: Let the Computer Summarize the Knowledge

Presentation Transcript

“Let the light of knowledge take away the darkness of ignorance”

The Gene…

Summarize

The Social Gene vs. the Selfish Gene

Summarize

THE GENE

Automated Gene Summary: Let the Computer Summarize the Knowledge

Summarize

Automated Gene Synthesis Machines

Let Computer Draw

SUMMARIZE (Give the GIST )

AUTOMATED VALIDATION FOR SUMMARY OF THE DAY TEMPERATURE DATA

Summarize

Summarize the Story

Summarize

Summarize the hospitality Industry 1.01

Automated Commercial Environment Executive Summary

Computer Knowledge

Summarize

SUMMARIZE THE ETHICAL / TUTORIALOUTLETDOTCOM

What Is the Difference Between Theme and Summary? Why Summarize or Find the Theme?

Summarize