170 likes | 358 Views
An Efficient Concept-Based Mining Model for Enhancing Text Clustering. Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03. Outlines. Motivation Objectives THEMATIC ROLES BACKGROUND CONCEPT-BASED MINING MODEL Experiments Conclusions
E N D
An Efficient Concept-Based Mining Model for Enhancing Text Clustering Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03
Outlines • Motivation • Objectives • THEMATIC ROLES BACKGROUND • CONCEPT-BASED MINING MODEL • Experiments • Conclusions • Comments
Motivation • Vector Space Model (VSM) • represents each document as a feature vector of the terms (words or phrases) in the document. • Each feature vector contains term weights (usually term frequencies) of the terms in the document. • term frequencycaptures the importance of the term within a document only. • However, two terms can have the same frequency in their documents, but one term contributes more to the meaningof its sentences than the other term. • Thus, the underlying text mining model should indicate terms thatcapture the semantics of text.
Objectives • A new concept-based mining model is introduced. • captures the semantic structure of each term within a sentence and document rather than the frequency of the term within a document only • effectively discriminate between nonimportant terms and terms which hold the concepts that represent the sentence meaning. • three measures for analyzing concepts on the sentence, document, and corpus levels are computed • a new concept-based similarity measure is proposed. • based on a combination of sentence-based, document-based, and corpus-based concept analysis. • more significant effect on the clustering quality due to the similarity’s insensitivity to noisy terms.
THEMATIC ROLES BACKGROUND • Verb argument structure: (e.g., John hits the ball). • “hits” is the verb. • “John” and “the ball” are the arguments of the verb “hits,” • Label: A label is assigned to an argument, • e.g.: “John” has subject (or Agent) label. “the ball” has object (or theme) label, • Term: is either an argument or a verb. • either a word or a phrase • Concept: a labeled term. • Generally, the semantic structure of a sentencecan be characterized by a form of verb argument structure
CONCEPT-BASED MINING MODEL • Sentence-Based Concept Analysis • Calculating ctf of Concept c in Sentence s • the conceptual term frequency, ctf • the number of occurrences of concept c in verb argument structures of sentence s. • has the principal role of contributing to the meaning of s • a local measure on the sentence level • Calculating ctf of Concept c in Document d • the overall importance of concept c to the meaning of its sentences in document d.
CONCEPT-BASED MINING MODEL • Document-Based Concept Analysis • the concept-based term frequencytf • the number of occurrences of a concept (word or phrase) c in the original document. • a local measure on the document level • Corpus-Based Concept Analysis • the concept-based document frequencydf • the number of documents containing concept c • used to reward the concepts that only appear in a small number of documents
Example of Calculating ctf Measure Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles. • Three verbs, colored by red, that represent the semantic structure of the meaning of the sentence. • Each has its own arguments: • [ARG0 Texas and Australia researchers] have [TARGETcreated] [ARG1 industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles]. • Texas and Australia researchers have created industry-ready sheets of [ARG1 materials] [TARGETmade] [ARG2 from nanotubes that could lead to the development of artificial muscles]. • Texas and Australia researchers have created industry-ready sheets of materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGM-MOD could] [TARGETlead] [ARG2 to the development of artificial muscles].
A clean step • To remove stop words • To stem the words
A Concept-Based Similarity Measure • The concept-based similarity between two documents, d1 and d2 is calculated by: mmatching concepts d1 d2 • The single-term similarity measure is: (using the TF-IDF weighting scheme)
Mathematical Framework • Assume that the content of document d2is changed by △ • Sensitivity analysis: • Assume that each concept consists of one word. • In this case, each concept is a word and A =1. (?) • By approximation, the d1c value is bigger than d1w and the △d2c value is bigger than the △d2w value. • Hence, the sensitivity of the concept-based similarity is higher than the cosine similarity. • This means that the concept-based model is deeper in analyzing the similarity between two documents than the traditional approaches.
Concept-Based Analysis Algorithm d1 d2 d3 d4 d1 d2 L d3 L L d4 L L L
EXPERIMENTAL RESULTS Evaluation methods • Four data sets • 23,115 ACM abstract articles collected from the ACM digital library • five main categories • 12,902 documents from the Reuters 21,578 data set • five category sets • 361 samples from the Brown corpus • main categories were press: reportage; press: reviews, religion, skills and hobbies, popular lore, belles-letters, and learned; fiction: science; fiction: romance and humor. • 20,000 messages collected from 20 Usenet newsgroups • Three standard document clustering techniques: • Hierarchical Agglomerative Clustering (HAC), • Single-Pass Clustering • k-Nearest Neighbor (k-NN)
Conclusions • Bridges the gap between natural language processing and text mining disciplines. (?) • By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved. • A number of possibilities for extending this paper. • link this work to Web document clustering. • apply the same model to text classification.
Comments • Advantages • Better similarity considering the semantic structure of sentences in documents. • Shortages • Ambiguous algorithm • Applications • Text clustering • Text classification