260 likes | 440 Views
A ctive Learning of Instance-level Constraints for Semi-supervised Document Clustering. By Nishitha Guntakandla. Keywords:. Semi-supervised Clustering Document Clustering Instance-level Constraints Active Learning. Introduction:.
E N D
Active Learning of Instance-level Constraints for Semi-supervised Document Clustering By NishithaGuntakandla
Keywords: • Semi-supervised Clustering • Document Clustering • Instance-level Constraints • Active Learning
Introduction: • It provides a framework that actively selects informative documents pairs for semi-supervised document clustering • Semi supervised clustering uses a small amount of labeled data to aid and bias the clustering process. • Additional Information such as instance-level constraints are available. • Most of the semi-supervised document clustering approaches make use of this additional information in a passive manner. • Clustering approach presented in this paper plays an active role, i.e., user’s feedbacks on those actively selected queries contain more information to help the clustering task.
Instance level constraints: • There are two types of instance levels constraints • Must-link • Cannot-link
Semi Supervised Document Clustering with Instance-level constraints • The semi supervised document clustering algorithm used in this paper builds upon DBSCAN. • It is a density based clustering algorithm which effectively partitions the data set based on density. • This algorithm basically requires two parameters Ԑ and Minpts. • The neighbourhood within a radiusԐ of a given object is called the “Ԑ-neighbourhood”.
Common terms in DBSCAN • An object with at least Minpts of objects within its Ԑ-neighborhood is called a “core object”. • Otherwise the object is called a “border point”. • Objects within the same neighborhood are “directly density-reachable”. • Core objects in overlapping neighborhood which are directly density-reachable are “density-reachable”. • Objects in overlapping neighborhood which are density-reachable are “density-connected”. • Each object which is not included in any cluster is “noise”.
Cons-DBSCAN • The Semi supervised document clustering algorithm is a Constrained DBSCAN (Cons-DBSCAN), which incorporates instance-level constraints to guide the clustering process in DBSCAN. • Inputs: • Document set D • A set of must-link constraints ML • A set of cannot-link constraints CL • Epsilon Ԑ • Minpts
Cont.. • Outputs: • Several clusters • Set of noises • Firstly, we preprocess the constraints. Must link constraints represent an equivalence relation • Therefore, we can compute a collection of transitive closures from ML, noted as TCS= {c1,c2,…….cs}
Cont.. • Each pair of instances in the same transitive closure must be in the same cluster in clustering result. • For every pair of transitive closures ci and cj that have atleast one cannot link between them, we add cannot link constraints between every pair of points in ci and cj and augment CL by these entailed constraints. • CL and TCS to guide the process of expanding clusters in DBSCAN.
Algorithm1:Cons-DBSCAN • Input: Set of documents D, set of must-link constraints ML, set of cannot-link • constraints CL, the radius Eps, the minimum number MinPts • Output: Several clusters and a set of noises • 1) Initialize all objects in D as UNCLASSIFIED; • 2) Preprocess the constraints: compute the collection of transitive closures TCS • and update the set of cannot-link constraints CL; • 3) ClusterId: = 0; • 4) For each object Point in D do • If Point’s label is UNCLASSIFIED then • If Cons-ExpandCluster (D, Point, Eps, MinPts, CL, TCS, ClusterId) • then • ClusterId := ClusterId+1; • End If • End If • 5) End For • 6) End
Algorithm 2 Cons-ExpandCluster • Input: Set of documents D, starting object to expand a cluster Point, the radius Eps, the minimum number MinPts, set of cannot-link constraints CL, collection of transitive closures TCS, current cluster No. ClusterId • Output: A Boolean status • 1) Initialize seeds as an empty queue; • 2) Compute Point’s Eps-neighborhood neighborhood; • 3) If the number of objects in neighborhood < MinPts then • Label Point as NOISE temporarily; • Return false; • 4) End If • 5) If Point belongs to a transitive closure ci in TCS then • For each object o in ci do • Label o as ClusterId; • Add o into seeds; • End For • 6) else • Label Point as ClusterId; • add Point into seeds; • 7) End If
8) While (seeds is not empty) a) Get the first object seed in seeds; b) If seed belongs to a transitive closure cj in TCS then For each object o in cj do If the label of o is NOISE or UNCLASSIFIED then Label o as ClusterId; Add o into seeds; End If End For c) End If d) Compute the seed’s Eps-neighborhood neighborhood; e) If the number of objects in neighborhood ≥ MinPts then For each object p in neighborhood do If adding p into seeds does not violate cannot- link constraints and p’s label is NOISE or UNCLASSIFIED, then Label p as ClusterId; Add p into seeds; End If End For f) End If g) Delete seed from seeds; 9) End While 10) Return true; 11) End
Active Learning Algorithm: • Goal of active learning is to select instance-level constraints which are most informative about the underlying clustering in documents set. • Oracle labels a given pair (xi , xj ) as must-link or cannot-link • Constraint sets vary significantly in how useful they are for constrained clustering. • Some constraint sets can actually decrease algorithm performance. It is a result of the interaction between a given set of constraints and the algorithm being used
Cont.. • Two quantitative measures, “Informativeness” and “Coherence”, that can be used to identify useful constraint sets in partitional clustering algorithms • Informativeness refers to the amount of information in the constraint set that the algorithm cannot determine on its own. It is determined by the clustering algorithm’s objective function (bias) and search preference. • Coherence measures the amount of agreement with in the constraints themselves, with respect to a given distance metric. • The constraints set with a large informativeness and coherence should satisfy the following two properties: • 1) at least one point in each underlying clusters is involved; • 2) there exist constraints to control the boundary of each cluster. • Based on these two observations, active learning scheme is proposed
The algorithm needs two parameters, Eps and MinPts, which have the same meaning to Epsand MinPtsin DBSCAN respectively. • Firstly, all the core objects and border objects with respect to Eps and MinPts. • The core objects set (CS)and border objects set (BS) are formed. • Then, constraints are selected in an iterative manner. • In the first iteration, we randomly select one core object set from CS and add ‘x’ into the selected core object set SCS which is initialized as null • Select two points b1 and b2 from BS, corresponding to the nearest and farthest border objects from ‘x’ respectively. • the oracle is accessed to answer the two instance-level queries (x, b1) and (x, b2)
In the subsequent iteration, select one core object x from CS, which is the farthest point from SCS. • And queries are posed by pairing x with each point in SCS • Then the same procedure as in the first iteration is performed to get the two instance-level constraints (x, b1) and (x, b2). • The iteration continues until the queries are exhausted.
Experiments: • two real-world document corpora are used • The first corpus is the 20-Newsgroups collection. • It has messages collected form 20 different Usenet newsgroups, 1000 messages form each newsgroup. i.e., 20*1000 = 20,000 messages • News-all20 has 2000 points in 16089 dimensions • News-sim3 has 300 points in 3225 dimensions. The second document set is the Topic Detection and Tracking program2 , which provides five corpora to support TDT research: the TDT Pilot corpus and the TDT2, TDT3, TDT4 and TDT5 corpus. The TDT5 corpus is used to test the effectiveness of our approach. It is composed of 3905 points in 19325 dimensions
Evaluation: • four external criteria of clustering quality. • Purity is a simple and transparent evaluation measure. • Normalized mutual informationcan be information-theoretically interpreted. • TheRand index penalizes both false positive and false negative decisions during clustering. • TheF measure in addition supports differential weighting of these two types of errors. • Normalized Mutual Information (NMI) and pairwise F-measure are used as the clustering validation criteria.
Advantages of cons-DBSCAN: • Resistant to noise • Can handle clusters of different shapes and sizes
Areas where cons-DBSCAN is not applicable: • Varying densities • High dimensional data. • There are four approaches to use for high dimensional data • Subspace clustering • Projected clustering • Hybrid approaches • Correlation clustering