240 likes | 252 Views
Learn how Vocabulary Spectral Analysis is used as an exploratory tool for Scientific Web Intelligence, including its application in academic web mining and subject-based clustering using the Vector Space Model. Discover how to identify patterns, visualize relationships, and incorporate user feedback into the analysis.
E N D
Mike Thelwall Professor of Information Science University of Wolverhampton Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence
Contents • Introduction to Scientific Web Intelligence • Introduction to the Vector Space Model • Vocabulary Spectral Analysis • Low frequency words
Part 1 Scientific Web Intelligence
Scientific Web Intelligence • Applying web mining and web intelligence techniques to collections of academic/scientific web sites • Uses links and text • Objective: to identify patterns and visualize relationships between web sites and subsites • Objective: to report to users causal information about relationships and patterns
Academic Web Mining • Step 1: Cluster domains by subject content, using text and links • Step 2: Identify patterns and create visualizations for relationships • Step 3: Incorporate user feedback and reason reporting into visualization This presentation deals with Step 1, deriving subject-based clusters of academic webs from text analysis
Part 2 Introduction to the Vector Space Model
Overview • The Vector Space Model (VSM) is a way of representing documents through the words that they contain • It is a standard technique in Information Retrieval • The VSM allows decisions to be made about which documents are similar to each other and to keyword queries
How it works: Overview • Each document is broken down into a word frequency table • The tables are called vectors and can be stored as arrays • A vocabulary is built from all the words in all documents in the system • Each document is represented as a vector based against the vocabulary
Example • Document A • “A dog and a cat.” • Document B • “A frog.”
Example, continued • The vocabulary contains all words used • a, dog, and, cat, frog • The vocabulary needs to be sorted • a, and, cat, dog, frog
Example, continued • Document A: “A dog and a cat.” • Vector: (2,1,1,1,0) • Document B: “A frog.” • Vector: (1,0,0,0,1)
Measuring inter-document similarity • For two vectors d and d’ the cosine similarity between d and d’ is given by: • Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together • The cosine measure calculates the angle between the vectors in a high-dimensional virtual space
Stopword lists • Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing • E.g. “in”, “a”, “the”
Normalised term frequency (tf) • A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document • This is known as the tf factor. • Document A: raw frequency vector: (2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)
Inverse document frequency (idf) • A calculation designed to make rare words more important than common words • The idf of word i is given by • Where N is the total number of documents and ni is the number that contain word i
tf-idf • The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word • Words are important for a document if they are frequent relative to other words in the document and rare in other documents
Part 3 Vocabulary Spectral Analysis
Subject-clustering academic webs through text similarity 1 • Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university. • Doc. 1 = cs.auckland.ac.uk 14,521 pgs • Doc. 2 = www.auckland.ac.nz 3,463 pgs • … • Doc. 760 = www.vuw.ac.nz 4,125 pgs
Subject-clustering academic webs through text similarity 2 • Convert each virtual document into a tf-idf word vector • Identify clusters using k-means and VSM cosine measures • Rank words for importance in each ‘natural’ cluster Cluster Membership Indicator • Manually filter out high-ranking words in undesired clusters • Destroys the natural clustering of the data to uncover weaker subject clustering
Cluster Membership Indicator For a cluster C of documents and tdf-idf weights wij The next slide shows the top CMI weights for an undesired non-subject cluster
Eliminating low frequency words • Can test whether removing low frequency words increases or decreases subject clustering tendency • E.g. are spelling mistakes? • Need partially correct subject clusters • Compare similarity of documents within cluster to similarity with documents outside cluster
Summary • For text based academic subject web site clustering: • need to select vocabularies to break natural clustering and allow subject clustering • consider ignoring low frequency words because they do not have high clustering power • Need to automate the manual element as far as possible • The results can then form the basis of a visualization that can give feedback to the user on inter-subject connections