Beespace Component: Filtering and Normalization for Biology Literature

Beespace Component:Filtering and Normalizationfor Biology Literature Qiaozhu Mei 03.16.2005

Concept Processing Component for Beespace: A Big Picture A list of Representative Terms Or phrases Filtering Module Relevant documemts Query terms Retrieval entities & phrases of interest Similarity Groups Of Terms and Phrases (Concepts) Normalization And Clustering Module Pre-processed Text Collection

Concept Processing Component for Beespace: Input and Output • Input: texts (indices) with entities and phrases tagged. • Filtering: a group of relevant documents for a query • Normalization: a list of terms, entities or phrases of interest to be normalized • Output: • Filtering: list of highly representative terms & phrases • Normalization: • hierarchical structure of concepts (compacted, loose) • Concept dictionary • texts tagged with concepts

Filtering

Term Filtering: Heuristics • We want to find a list of representative terms & phrases short enough to enable interactive selection and navigation. • We want terms with higher frequency in the given documents, (high Term Frequency), however… • Terms too frequent in the whole collection are considered harmful: the, is, cell, bee, …(low Document Frequency)

Term Filtering: TF*IDF • Adding IDF to frequency count: • Weight = tf * log ((N – 1)/df) • TF-IDF formula in Okapi method: • Weight = IDF TF part

Term Filtering (cont.) • Results 1: • Collection: honeybee.biosis 1980 • Query: “pollen-foraging” • Select top 2 documents • Results 2: • Collection: GENIA (on “human & blood cell & transcription factor”), with noun phrases of entities tagged • Query: “il-2”

Normalization

From Term to Concept: Normalization and Theme Clustering • Normalization: Tight concepts • Group terms/entities/phrases with similarity so that one can represent others • Forage: forager, forage-bee, foraging, foragers, pollen-foraging… • Theme clustering: Looser concepts • Group terms/entities/phrases representing the same subtopic (semantically related) • forage, pollen, food, detect, feeding, dance, … • In a hierarchical manner.

Normalization • Morphological approach? (stemming) • Normalize English words of morphological variations, e.g. • forag: forage/foraging/forager/foragers • Concerns: • Too cruel? one->on; day->dai; apis-> api; useful -> us • Handling biological entities? (some do nothing when detect “-”) • Not sufficient to normalize phrases

Normalization: Stemmers • Porter Stemmer: • does not stem words beginning with an uppercase letter • Krovetz' Stemmer: • Less aggressive than porter • Sample results: • Honeybee: • Genia:

Normalization (cont.) • Semantic and Contextual Approach: • Group the terms which are considered “Replaceable” with each other in a context. E.g. • …the pollen-foraging activity of a mellifera… • …the nectar-foraging activity of a cerana… • Generally handled with clustering approaches based on statistical information in a large corpus • Usually in the form of hierarchical clusters

Normalization: A clustering approach • A N-gram clustering method: • Ideally, if we consider the terms in its N-Gram context, the replaceable relation would be global and reliable. • Concerns: efficiency • Computing complexity is high! • For 2-gram, NV2 even after optimization! (initially V5) • Space complexity is high!! • V3 • Compromising: use 2-gram (equivalent to computing the average mutual information of 2-grams and group two terms which will bring the smallest loss to this avg. MI)

Normalization: A clustering approach (cont.) • Toy Example on honeybee: • Vocabulary size: 9100 words; • Collection size: 5505 abstracts; (honeybee.biosis1980) • Terms to be Clustered: 18 • Genia collection, 2000 abstracts • 200 noun phrases (entities) to be clustered

nursing nurseries nursery nectar-foraging pollen-foraging foraging-related preforaging non-foraging forager forage foraging foragers queen worker queens workers bee honeybee

Sample clusters on Genia: human_and_mouse_gene mouse_il-2r_alpha_gene i_kappa_b_alpha nf_kappa_b transcription_factors transcription_factor saos_2_cells saos-2 human_osteosarcoma_ b_cells jurkat_t_cells hela_cells thp-1 hl60_cells k562_cells thp-1_cells epstein-barr_virus_ interleukin-2 interleukin-2_ epstein-barr_virus phorbol_myristate_acetate phorbol_12-myristate_13-acetate 2_gene_expression 2_gene u937_cells monocytic_cells jurkat_cells human_t_cells ipr_cd4-8-_t_cells j_delta_k_cells lymphoid_cells activated_t_cells hematopoietic_cells

Normalization: Clustering Methods • Other Possible Clustering Approaches • Cluster terms based on features such as: • Co-occurring terms • Tends to ignore position information • Correlation of Nouns and Verbs • Dependency-based Word Similarity • Proximity-based Word Similarity • Depend on highly accurate parsing result, which may be not easy to get for biology literature.

Theme Clustering • Looser Clusters • Usually in the form of partitioning clusters • K-Means, Latent Semantic Indexing, Probabilistic LSI • Compute loose clusters of terms, or clusters represented by term distributions • Example: # cluster = 10 • Sometimes helpful to find normalizations (e.g., when #clusters are large; when no stemming was done) • Comparative Text Mining for concept switching

Future Plan: • Customize the stemmers • Try more morphological approaches. • e.g. pollen-foraging, nectar-foraging • Exam more clustering methods: • How to use theme clustering to help normalization • Find a way to divide the hierarchical clustering structure into concepts

Thanks!

Beespace Component: Filtering and Normalization for Biology Literature