1 / 18

Grade clustering and seriation of words based on their co-occurrences

Explore the method of clustering and seriating words based on their co-occurrences to improve search results and find synonyms. Learn about Grade Correspondence Analysis and HAL method for text mining. See how deviations from regularity in word patterns can reveal valuable insights.

Download Presentation

Grade clustering and seriation of words based on their co-occurrences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grade clustering and seriation of words based on their co-occurrences Emilia Jarochowska & Krzysztof Ciesielski Institute of Computer Science, Poland

  2. Summary Using data on terms’ co-occurrence, extracted from a newsgroup sample, we seek for the terms’ most regular arrangement and show how the obtained pattern allows a convenient visualization and clustering.

  3. Clustering of documents and terms: what for? • Improving and grouping search results • Finding synonyms: construction of thesauri, query expansion based on the synonyms of the entered terms • Finding collocations

  4. The common approach to term clustering • Association matrices which quantify term correlations • This global approach does not necessarily adapt well to the local context

  5. The local approach Co-occurrence is identified within a sliding window instead of whole document and arranged into a contingency table (symmetric matrix).

  6. Material A collection of posts from 20newsgroups, widely used asa benchmark for text-miningmethods http://people.csail.mit.edu/jrennie/20Newsgroups/ comp.windows.x rec.antiques.radio+phono rec.sport.hockey sci.med talk.religion.misc Entropy of within-groupfrequencies (condition) 363 automatically selected keywords representing these groups

  7. Methods • Stemming – to reduce inflected forms to one representative • HAL (Hyperspace Analogue to Language) • Grade correspondence analysis implemented in the GradeStat program

  8. HAL HAL generates matrix H in which the cell hijcorresponds to the similarity measure of the terms i and j. If s = (t1,...,tk) is a sentence (an ordered list of terms), then hijis the sum (over all sentences in a collection of documents) of co-occurrences of terms i and j. Several forms of normalizations are possible.

  9. Grade Correspondence Analysis GCA transforms a data matrix into a probability table and iteratively permutes rows and columns to make it more strongly and regularly positive dependent by maximizing Kendall’s tau.

  10. Regularity and deviation from it In the most regular arrangement possible, the deviation from regularity for each pair of observations or variables can be measured as: armax - |ar| where ar is the concentration index of the two distributions describing that particular pair of observations/variables, and armax is the respective maximum concentration index.

  11. Overrepresentation maps Contingency matrices are here visualized by means of overrepresentation maps. Overrepresentation is defined as follows:

  12. Results

  13. Polarization between groups of terms Computer-related terms ftp, server, unix, MIT,Columbia, mac, graphic, video, display, internet Political and religious terms murder, belief, kill, faith, Jewish, moral, hell, death, children, shot,war, fire, arm, defense, absolut,burn, Bible

  14. Deviations from regularity • Are themselves more regular than original data • Thus are better descriptors of the position of a term in the dataset

  15. Examples of seriation company example baseball produce house april war city ftp commerce general computers sport war religion Clusters

  16. Conclusions • We identified two disjunctive groups composed of very specific terms and a group of terms with various affinities to these extremes→ a scale obtained in a process of unsupervised learning • Deviation from regularity in the dataset characterizes terms better than simply co-occurrence data

  17. Plans for future Deviation from regularity used as a criterion in outlier detection might indicate words used inadequately to the context, neologisms etc.

  18. Thank you for attention http://gradestat.ipipan.waw.pl/english/

More Related