220 likes | 357 Views
Automated Gene Summary: Let the Computer Summarize the Knowledge. Xu Ling Department of Computer Science University of Illinois at Urbana-Champaign. The Reality of Scientific Literature. Hard to keep up manual curation!. Automated Gene Summarization. Gene summary. . . . . . . . . .
E N D
Automated Gene Summary:Let the Computer Summarize the Knowledge Xu Ling Department of Computer Science University of Illinois at Urbana-Champaign
The Reality of Scientific Literature Hard to keep up manual curation!
Automated Gene Summarization Gene summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gene product Expression Sequence Interactions Mutations General Functions
Goal • To retrieve and summarize all the knowledge about a particular gene from the literature • Compressing knowledge: enables biologists to quickly understand the target gene. • Automated curation: explicitly covers multiple aspects of a gene, such as the sequence information, mutant phenotypes etc.
Semi-structured summary on multiple aspects Gene products Expression pattern Sequence information Phenotypic information Genetical/physical interactions … 2-stage summarization Retrieve relevant articles by gene name search Extract most informative and relevant sentences for each aspects. Our Solution
System Overview: 2-stage Gene name recognition Sentence Categorization
Gene Name Recognition • v1: Dictionary-based string match • High recall, low precision • v2: Machine learning methods of gene name recognition • High precision, low recall • v3: v2 + dictionary based synonym expansion • Improved in both recall and precision
Categorization of Retrieved Sentences • Collect “example sentences” from FlyBase • v1: applying vector space model to construct aspect “profile”. • v2: applying probabilistic models to factor out context-specific language. • v3: v2 + biologist labeled training examples. Real sentence! Many thanks for the help by Susan Brown’s “Beetle group” !
Gene Summary in BeeSpace v4 • To add
General Entity Summarization • General and applicable to summarize other entities: pathways, protein family, … • General settings: • Space: A set of documents to be summarized. • Aspects: A set of aspects to define the structure of the summary. • Examples: Training sentences for each aspect.
Further Generalization … • Limitations of the categorization problem with training examples • Predefined aspects, may not fit the need of a particular user • Only works for a predefined domain and topics • Training examples for each aspect are often unavailable • More Realistic New Setup • Allow a user to flexibly describe each facet with keywords (1-2): let the user determine what they want • Generate the summary in a semi-supervised way: no need of training examples
Example (1): Consumer vs. Editor Honda accord 2006
Example (2): Different Aspects 17 • What if the users want an overview with different facets?
Conclusion • The generated summaries are • directly useful to biologists, • and also serve as entry points to enable them to quickly navigate relevant literatures, • via the BeeSpace analysis environment available at www.beespace.uiuc.edu
Start from Here … • The reverse of automated entity summarization: automated entity retrieval • Profiling of entities using entity summary Eg.,what genes are associated with … ? • Build a powerful knowledge base … • Enriched entities under certain context Eg.,what are the significantly enriched genes in …? • Entities involved in certain biomedical relations Eg.,what genes are interacting with gene X ? BeeSpace v5 !
Acknowledgement Bruce Schatz Gene Robinson Chengxiang Zhai Xin He Jing Jiang Qiaozhu Mei Moushumi Sarma
Vector Space Model (VSM) • Construct a corresponding term vector Vc using the training sentences for the aspect • The weight of a term ti in the aspect term vector for aspect j: wij=TFijIDFi, where TFij= term frequency, IDFi= 1 + log(N/ni) is the inverse document frequency (N=total number of documents, ni=number of documents containing term ti). • Construct a sentence term vector Vs for each sentence • with the same IDF and TF=number of times a term occurs in the sentence • Aspect relevance score S=cos(Vc, Vs).