180 likes | 194 Views
Explore the method of clustering and seriating words based on their co-occurrences to improve search results and find synonyms. Learn about Grade Correspondence Analysis and HAL method for text mining. See how deviations from regularity in word patterns can reveal valuable insights.
E N D
Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland
Summary Using data on terms’ co-occurrence, extracted from a newsgroup sample, we seek for the terms’ most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.
Clustering of documents and terms: what for? • Improving and grouping search results • Finding synonyms: construction of thesauri, query expansion based on the synonyms of the entered terms • Finding collocations
The common approach to term clustering • Association matrices which quantify term correlations • This global approach does not necessarily adapt well to the local context
The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).
Material A collection of posts from 20newsgroups, widely used asa benchmark for text-miningmethods http://people.csail.mit.edu/jrennie/20Newsgroups/ comp.windows.x rec.antiques.radio+phono rec.sport.hockey sci.med talk.religion.misc Entropy of within-groupfrequencies (condition) 363 automatically selected keywords representing these groups
Methods • Stemming – to reduce inflected forms to one representative • HAL (Hyperspace Analogue to Language) • Grade correspondence analysis implemented in the GradeStat program
HAL HAL generates matrix H in which the cell hijcorresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hijis the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j. Several forms of normalizations are possible.
Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendall’s tau.
Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.
Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:
Polarization between groups of terms Computer-related terms ftp, server, unix, MIT,Columbia, mac, graphic, video, display, internet Political and religious terms murder, belief, kill, faith, Jewish, moral, hell, death, children, shot,war, fire, arm, defense, absolut,burn, Bible
Deviations from regularity • Are themselves more regular than original data • Thus are better descriptors of the position of a term in the dataset
Examples of seriation company example baseball produce house april war city ftp commerce general computers sport war religion Clusters
Conclusions • We identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes→ a scale obtained in a process of unsupervised learning • Deviation from regularity in the dataset characterizes terms better than simply co-occurrence data
Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.
Thank you for attention http://gradestat.ipipan.waw.pl/english/