Grade clustering and seriation of words based on their co-occurrences

Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland

Summary Using data on terms’ co-occurrence, extracted from a newsgroup sample, we seek for the terms’ most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.

Clustering of documents and terms: what for? • Improving and grouping search results • Finding synonyms: construction of thesauri, query expansion based on the synonyms of the entered terms • Finding collocations

The common approach to term clustering • Association matrices which quantify term correlations • This global approach does not necessarily adapt well to the local context

The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).

Material A collection of posts from 20newsgroups, widely used asa benchmark for text-miningmethods http://people.csail.mit.edu/jrennie/20Newsgroups/ comp.windows.x rec.antiques.radio+phono rec.sport.hockey sci.med talk.religion.misc Entropy of within-groupfrequencies (condition) 363 automatically selected keywords representing these groups

Methods • Stemming – to reduce inflected forms to one representative • HAL (Hyperspace Analogue to Language) • Grade correspondence analysis implemented in the GradeStat program

HAL HAL generates matrix H in which the cell hijcorresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hijis the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j. Several forms of normalizations are possible.

Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendall’s tau.

Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.

Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:

Results

Polarization between groups of terms Computer-related terms ftp, server, unix, MIT,Columbia, mac, graphic, video, display, internet Political and religious terms murder, belief, kill, faith, Jewish, moral, hell, death, children, shot,war, fire, arm, defense, absolut,burn, Bible

Deviations from regularity • Are themselves more regular than original data • Thus are better descriptors of the position of a term in the dataset

Examples of seriation company example baseball produce house april war city ftp commerce general computers sport war religion Clusters

Conclusions • We identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes→ a scale obtained in a process of unsupervised learning • Deviation from regularity in the dataset characterizes terms better than simply co-occurrence data

Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.

Thank you for attention http://gradestat.ipipan.waw.pl/english/

Grade clustering and seriation of words based on their co-occurrences

Grade clustering and seriation of words based on their co-occurrences

Presentation Transcript

Words Their Way

Words and their meaning

APA Seriation

WORDS THEIR WAY

The Power of Their Words

Occurrences

Words and their associations

Graph Clustering based on Random Walk

Semiautomatic Extension of CoreNet using a Bootstrapping Mechanism on Corpus-based Co-occurrences

Auto administration of databases based on clustering

Co-clustering based classification for Out-of-domain Documents

Words Their Way

Automatic Acquisition of Paradigmatic Relations using Iterated Co-occurrences

Idea of Co-Clustering

Words Their Way and Word Study in Third Grade

Distributional Clustering of English Words

Words Their Way

Words Their Way

Words and their parts

Words Their Way

Distributional clustering of English words

Words Their Way