Efficient C oncept Clustering for Ontology L earning using an Event Life Cycle on the Web

Efficient Concept Clustering for Ontology Learning using an Event Life Cycle on the Web By Sangsoo Sung, Seokkyung Chung, Dennis McLeod Presented by Amir Tahmasebi

Overview • Motivation • Concept clustering • Creating Rough Clusters • Algorithm for creating rough Clusters • Similarity computation with rough clusters • Complexity Analysis • Experiment • Conclusion

Motivation • Why do we need ontology learning? • Handcrafted Ontologies VS. Automatically/Semi-Automatically created • Pros and Cons • The manual approach of extracting semantic meanings cannot scale with the growth of the Web.

Clustering • Definition? • Given a set of terms, need to distribute the terms into clusters. After clustering, semantic relations of the terms within each cluster can be determined. • Problems? • Computationally very expensive • Complexity? • Due to pair-wise similarity computations

Clustering • Solution? • Break up term space into multiple subsets using a cheap division algorithm. • Create Rough Clusters • This significantly reduces number of pair-wise comparisons.

Creating Rough Clusters • How can we break up the term space? • Using Event Life Cycle Phenomena • Certain events generate posting to the web. The volume of these posting starts small, grows and and gradually diminishes. • But how can we use this phenomena for clustering purposes? • Terms that have the same posting peaks are more likely to be related.

Gallistel change point finding algo.

Gallistel change point finding algo. • How could pcp be verfied? • By performing an unequal variance t-test: • Where: • A change point is identified when t is significantly far from zero, rejecting the null hypothesis.

Gallistel change point finding algo. • A set of terms (ωt) whose elements have the same change point is defined as follows: • A set (ωt) is also defined as follows: • Where Ω covers the entire time span.

Gallistel change point finding algo. • Then Ω is clustered into overlapping sub-sets with respect to α and β which are distance thresholds. β α t0 β α t0

Cluster refinement using expensive similarity metric • String similarity VS context-based similarity (Pros & Cons) • This research focuses on Context-based similarity

Cluster refinement using expensive similarity metric • Using tf-idf vector wrighting scheme • Λp : Set of all documents within cluster Ωp • Λpis incorporated to generate a tf-idf for each candidate term x in Ωp. • The vector include term that co-occurred with term x, and the weight of the terms is defined as:

Cluster refinement using expensive similarity metric • All elements of vi are eliminated except m terms with the highest wx,di. • Let ϒ(x) be centroid vector of all vi’s • Cosine metric is used to determine similarity of terms: • Where

Complexity Anlysis • Comparing method with rough clustering (Oa) VS method without rough clustering (Ob):

Equations: O(N) O(N2)

Complexity Anlysis • Comparing method with rough clustering (Oa) VS method without rough clustering (Ob): • As a Result: • C(Oa) = O(N+L2) where L is number of terms in rough clusters • C(Ob) = O(N2)

Experiment

Conclusion • Given large number of quantities with many billions terms, quantifying all pairwise similarities of terms is very expensive. This paper presents a new method based on Event Life Cycle phenomena to divide the terms space into rough cluster before pair wise similarity computations.

Questions?

Efficient C oncept Clustering for Ontology L earning using an Event Life Cycle on the Web