Catpac & WordStat

Catpac & WordStat Dongwoo Kim & Fran Stewart COM633, Fall 2010

Catpac: The Basics • Originally created by Joseph Woelfel to examine consumer behavior and marketing. • Presented as part of the Galileo package of software analysis tools. • Billed as a “self-organizing artificial neural network” optimized for examining text.

Catpac: What It Does • Recognizes frequency of words used in text. • Focuses on co-occurrence of words – words that appear near each other in context. • Uses cluster analysis to display word co-occurrence. • Incorporates ThoughtView’s perceptual mapping and Oresme’s interactive clustering.

Catpac: How Does It Work? • Catpac moves a window of n words through the text. For example, if the window size selected is 7 words, then Catpac will systematically scan words 1-7, then words 2-8, 3-9 and so on until it completes the document. • Words appearing in the window then activate the neurons representing them. Connections among activated neurons allow Catpac to associate words that appear close together within the text. • .

Getting Started • Catpac can only be used on ASCII text files so Word documents will need to be converted to .txt files. • The most simple analysis is a “dendogram,” according to the Galileo manual. • A dendrogramis “a branching diagram representing a hierarchy of categories based on degree of similarity or number of shared characteristics especially in biological taxonomy.” – Merriam-Webster • Dendron is Greek for tree.

Step 1: Convert to .txt file

Step 2: Input .txt file in Catpac.

Step 3: Select file to be analyzed.

Step 4: Make a Dendrogram.(Note the spelling error.)

This is what you will see ... 50 most frequent words 25 most frequent words

… after you exclude common words. (This seems a bit clunky given what the program purports to do.)

Results • Of the 738 total words in Klaus Krippendorff’s article on “Testing the Reliability of Content Analysis Data: What Is Involved and Why?”, the most frequently used word in the text was data. It appears 84 times, accounting for 11.4 percent of all words used. • Reliability was the next most often-used word, accounting for more than 8 percent of the total words used.

Compare that to … His 1,368-word discussion of “Computing Krippendorff’s Alpha-Reliability,” where data was the most frequently used word (excluding common articles and prepositions). The word appears 65 times, followed closely by reliability and observers.

Dendrograms Ward’s Method Centroid

Examining Word Clusters • The Oresme interactive clustering function allows for examining concepts that are associated with each other. • “Cycle Input” tells which concepts are activated by a selected concept. • “Cycle Output” “cycles the network output window back into itself.” • Huh? • “Instead of ‘thinking’ about the concepts you originally gave it, it is thinking about the concepts generated by the concepts you originally gave it.”

This is what it looks like … Cycle Input Cycle Output The manual makes note of what some analysts call the “Buddhist monk syndrome,” where “after sufficient contemplation, it appears that all things are one.”

To map these cluster concepts … • First save as a crud file (.crd). • “Select Open from the ThoughtView File menu.” • Wait, where the heck is ThoughtView? (CRD files extract coordinate information from the dendrograms.)

2D mapping of concept clusters Note the tight grouping of words like reliability, data and coders on the right.

3D mapping of concept clusters

3D mapping allows for rotation …

Now for a demonstration …

WordStat

WordStat is… • Content analysis module of SimStat. • Designed to analyze textual information (open-ended responses, interview transcripts, journal articles, news stories, websites, etc.) • Used both for automatic categorization of text using a dictionary and for manual coding.

WordStat has… • Integrated text-mining analysis and visualization tools. • Hierarchical categorization dictionary or user-generated dictionary. • Keyword-in-context (KWIC) and keyword retrieval tools. • Capability of statistical analyses (factor analysis, word frequencies, etc.).

Getting Started • First open SimStat because WordStat must be run as part of the SimStat program. • Build your own dictionary because WordStat’s standard dictionaries are lacking. • Run spell-check on the text to be analyzed because misspelled words may be left uncoded. • Select text-type file (Text, MS Word, HTML, Excel, SPSS files)

Example Study • Sense of humor study data (N=288, 52 missing data included) • Open-ended responses (Q: instances of sense of superiority in humor) • Demographical information (gender, ethnic background and political philosophy) and sense of humor

How to get WordStat • Free trial version on web site; http://www.provalisresearch.com/wordstat/WordStatDownload.html • Dictionary; http://www.provalisresearch.com/wordstat/RID.html

How to use WordStat • Create or import an existing dataset

How to create dictionary • Add categories and words

Dictionary for example study

Results • Frequencies

Results • Frequencies - chart

Results • Frequencies – dendrogram, concept map

Results • Crosstab word count - gender

Results • Crosstab word count – political tendency

Results • Crosstab word count – ethnicity

Results • Crosstab word count – combination

Results • KWIC (Keyword-in-Context)

Reports • Overall Humor>Race>Family>Politics>Religion • Gender (M:105, F:131) Women used more Family (p<.05), less Politics (n.s.) <COUNT> <COLUMN PERCENT>

Reports • Ethnic background (W: 159, NW: 67) • White people used more Humor (p<.01), less Religion (n.s.) • Political philosophy (N=S Consv:13, Consv:30, Mid:64, Libr:63, S Libr:38, No Comment: 28) <COLUMN PERCENT> <COLUMN PERCENT>

Limitations • Incomplete dictionary • Overestimation: ambiguous words, overlapping • Underestimation: misspellings, odd expressions • Categorization: obscurations, incongruities

More? Q & A

Catpac &amp; WordStat

Catpac &amp; WordStat

Presentation Transcript

Catpac & WordStat

Catpac & WordStat