370 likes | 545 Views
Applications of Data Classification and Data Clustering in Content Management and Personalized Information Services. Motivations. The quantity of digital information have been increasing at an astonishing rate.
E N D
Applications of Data Classification and Data Clustering in Content Management and Personalized Information Services
Motivations • The quantity of digital information have been increasing at an astonishing rate. • As a result, it is typical that only a small fraction of the many available documents will be relevant to a given individual.
Emerging Information Retrieval Services • Intelligent search engines that conduct information retrieval based on users’ recent behavior and profiles and some statistics of the collection of documents. • Document filters that automatically extract relevant documents from the collection.
An Application of Documents Clustering • A user who submits query “Eva Air” may be interested in the financial performance of the company or the flight schedule. • If the user accessed the financial data of some other companies in a short time ago, then the intelligent search engine may rank the web documents related to the financial performance of Eva Air at the top of the list.
On the other hand, if the user accessed web documents related to traveling in a short time ago, then the search engine may rank web pages that provide the flight schedule information of Eva Air at the top of the list. • The search engine can provide the intelligent service addressed here by giving high ranking to those web documents that have high similarity with the documents that the user just accessed.
An Application of Term Clustering • When a user submits “FED”, the search engine should expand the query with “Federal Reserve Board” and “Greenspan”. • The search engine can provide such intelligent service with term clustering.
An Application of Document Classification • Many contents providers offer customized information services based on automatic documents classification. • For example, the executives of a company may be particularly interested in the documents that are related to a particular sector of industry.
One way the contents provider can offer such customized services is to employ a document classification utility.
Identifying Relevant Terms Based on Terms Clustering • Given a term occurrence record (referred to as the D-T table), we can identify term clusters to be used in term suggestion. • There are two common forms of D-T tables. The first one simply records term frequencies and the second one employs the market-basket data model.
Issues in Creating a D-T Table • Removal of stop words and terms, e.g. “the”, “of”,…, “the same as”. • Identifying word stems. For example, “normal”, “normally”, “normalization”, “normalized”, and “normalizing” all have the same word stem.
Term Clustering Based on the D-T Table • Each term is represented by a feature vector in an m-dimensional vector space. • Then, the cosine similarity is commonly employed and data clustering is conducted accordingly. The cosine similarity between two term feature vectors is defined as follows:
The reason why the cosine similarity, instead of distance similarity, is employed is that two term feature vectors may have different lengths. • For example, • The similarity measurement ranges from -1 to 1.
Normalization of the D-T Table • Since the length of each document is different and the popularity of each term is also different, we may want to normalize the values in the D-T table.
One normalization formula is where cij is the count at entry (i, j) in the original D-T table and vij is the normalized value.
Creation of a D-T Table with the Market-Basket Data Model • A threshold can be imposed on a normalized D-T table to create a D-T table with the market-basket data model as shown in the following page. • Once the market-basket D-T table is created, categorical data clustering algorithm can be invoked to generate term clusters.
Market-Basket D-T Table Threshold : 0.3
Summary of Term Clustering • Term clusters can be used in query expansion. • Term clusters can also be used in document clustering and classification.
Basics of Document Classification • In document classification, each term or term cluster is regarded as a feature (attribute) of the documents. In this field, “feature” is a more commonly used term than “attribute”. • It is common that a D-T table with the market-basket model is created.
Normally, a feature selection process is conducted, before classification is performed.
Importance of Feature Selection • Inclusion of features that are not correlated to the classification decision may make the problem even more complicated. • For example, in the data set shown on the following page, inclusion of the feature corresponding to the Y-axis causes incorrect prediction of the test instance marked by “”, if a 3NN classifier is employed.
y • It is apparent that “o”s and “x” s are separated by x=10. If only the attribute corresponding to the x-axis was selected, then the 3NN classifier would predict the class of “” correctly. x x=10
Feature Selection in Document Classification • Stop words and terms are first removed. • Then, a certain feature selection mechanism is invoked to further remove terms that are not correlated to document classes. • One way to carry out feature selection is to apply the chi-square test of independence to the training dataset.
Example of Feature Selection • For a D-T table with the market basket model: Class 1 Class 2
For term t1: • Therefore, t1 is selected.
For term t2: • Therefore, t2 is selected. • For term t3: • Therefore, t3 is not selected.
Feature Weighting • In addition to feature selection, feature weighting may be conducted. • One way to weight a feature is based on the chi-square statistic that is computed in the feature selection process. • For example, the weight for term t1 in the previous example can be either 4.8 or 4.8½ =2.19. Meanwhile, the weight for term t2 can be either 8 or 8½ =2.83.
Another popular approach is based on the “inverse document frequencies”, which will be addressed later.
Document Classification Based on the Cosine Similarity • The D-T table with appropriate feature weighting gives us a vector representation of each document. • One way to conduct document classification is to apply the KNN algorithm with the cosine similarity.
An Example • The weighted D-T table is as follows:
An Example • Given a new document Dt=<12.19, 12.83> sim<D1,Dt>=sim<D2,Dt>=sim<D3,Dt>=sim<D4,Dt>=0.61 sim<D5,Dt>=sim<D6,Dt>=sim<D7,Dt>=0.79 sim<D8,Dt>=1. Therefore, a 3NN classifier will predict Dt being a class 2 document.
Document Clustering • In document clustering, we do not have a training set, in which each document is labeled with one class. Therefore, the feature weighting mechanism that we have just discussed is not applicable. • A popular approach for weighting each term in document clustering is based on the “Inverse document frequencies”.
For example, in the following D-T table • idf(t1)=0.48 • idf(t2)=idf(t3)=idf(t5)=0.30 • idf(t4)=0.18
Therefore, we have the following feature vectors D1=<0.48, 0, 0, 0.18, 0> D2=<0, 0, 0.30, 0, 0.30> D3=<0, 0.30, 0, 0.18, 0.30> D4=<0.48, 0.30, 0, 0.18, 0> D5=<0, 0.30, 0.30, 0, 0.30> D6=<0, 0, 0.30, 0.18, 0> and the following similarity measurements.
0 • If we employ the complete-link algorithm, then we have the following dendrogram • If those similarity measurements that are less than 0.30 are excluded, then we have 3 document clusters {D1, D4}, {D2, D5, D6}, {D3}. 0.201 0.494 0.816 0.862 D1 D4 D2 D5 D6 D3