1 / 25

Ch 4: Information Retrieval and Text Mining

Ch 4: Information Retrieval and Text Mining. Hakam Alomari. 4.1: Is Information Retrieval a Form of Text Mining?. What is the principal computer specialty for processing documents and text?? Information Retrieval (IR) The task of IR is to retrieve relevant documents in response to a query.

maylin
Download Presentation

Ch 4: Information Retrieval and Text Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch 4: Information Retrieval and Text Mining Hakam Alomari

  2. 4.1: Is Information Retrieval a Form of Text Mining? • What is the principal computer specialty for processing documents and text?? • Information Retrieval (IR) • The task of IR is to retrieve relevant documents in response to a query. • The fundamental technique of IR is measuring similarity • A query is examined and transformed into a vector of values to be compared with stored documents

  3. Cont. 4.1 • In the predication problem similar documents are retrieved, then measure their properties, i.e. count the # of class labels to see which label should be assigned to a new document • The objectives of the prediction can be posed in the form of an IR model where documents are retrieved that are relevant to a query, the query will be a new document

  4. Search Document Collection Return Subset of Relevant Documents Specify Query Figure 4.1. Key Steps in Information Retrieval Examine Document Collection Learn Classification Criteria Apply Criteria to New Documents Figure 4.2. Key Steps in Predictive Text Mining Cont. 4.1

  5. Match Document Collection Get Subset of Relevant Documents Examine Document Properties Specify Query Vector Figure 4.2. Key steps in IR Figure 4.3. Predicting from Retrieved Documents simple criteria such as document’s labels

  6. 4.2 Key Word Search • The technical goal for prediction is to classify new, unseen documents • The Prediction and IR are unified by the computation of similarity of documents • IR based on traditional keyword search through a search engine • So we should recognize that using a search engine is a special instance of prediction concept

  7. We enter a key words to a search engine and expect relevant documents to be returned • These key words are words in a dictionary created from the document collection and can be viewed as a small document • So, we want to measuring how similar the new document (query) is to the documents in the collection

  8. So, the notion of similarity is reduced to finding documents with the same keywords as posed to the search engine • But, the objective of the search engine is to rank the documents, not to assign a label • So we need additional techniques to break the expected ties (all retrieved documents match the search criteria)

  9. 4.3 Nearest-Neighbor Methods • A method that compares vectors and measures similarity • In Prediction: the NNMs will collect the K most similar documents and then look at their labels • In IR: the NNMs will determine whether a satisfactory response to the search query has been found

  10. 4.4 Measuring Similarity • These measures used to examine how documents are similar and the output is a numerical measure of similarity • Three increasingly complex measures: • Shared Word Count • Word Count and Bonus • Cosine Similarity

  11. 4.4.1 Shared Word Count • Counts the shared words between documents • The words: • In IR we have a global dictionary where all potential words will be included, with the exception of stopwords. • In Prediction its better to preselect the dictionary relative to the label

  12. Computing similarity by Shared words • Look at all words in the new document • For each document in the collection count how many of these words appear • No weighting are used, just a simple count • The dictionary has true key words (weakly words removed) • The results of this measure are clearly intuitive • No one will question why a document was retrieved

  13. Computing similarity by Shared words • Each document represented as a vector of key words (zeros and ones) • The similarity of 2 documents is the product of the 2 vectors • If 2 documents have the same key word then this word is counted (1*1) • The performance of this measure depends mainly on the dictionary used

  14. Computing similarity by Shared words • Shared words is an exact search • either retrieving or not retrieving a document. • No weighting can be done on terms • in query, A and B, you can’t specify A is more important than B • Every retrieved document are treated equally

  15. 4.4.2 Word Count and Bonus 1/4 • TF – term frequency • number of times a term occurs in a document • DF –Document frequency • Number of documents that contain the term. • IDF – inversed document frequency • =log (N/df) • N: the total number of documents • Vector is a numerical representation for a point in a multi-dimensional space. • (x1, x2, … … xn) • Dimensions of the space need to be defined • A measure of the space needs to be defined.

  16. 4.4.2 Word Count and Bonus 2/4 • Each indexing term is a dimension • Each document is a vector • Di = (ti1, ti2, ti3, ti4, ... tik) • Document similarity is defined as K = number of words If word (j) occurs in both documents otherwise

  17. 4.4.2 Word Count and Bonus 3/4 • The bonus 1/df(j) is a variant of idf. Thus, if the word occurs in many documents, the bonus is small. • This measure better than the Shared Word count, because its discriminate among the weak and strong predictive words.

  18. 4.4.2 Word Count and Bonus 4/4 Labeled Spreadsheet Similarity Scores • A document Space is defined by five terms: hardware, software, user, information, index. • The query is “ hardware, user, information. New Document Vector Measure Similarity With Bonus Figure 4.4. Computing Similarity Scores with Bonus

  19. 4.4.3 Cosine SimilarityThe Vector Space • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term is in the document • Wi= 0 if the term is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*(1+log(N/dfi))where dfi is the number of documents contains the term i, and N the total number of documents in the collection.

  20. 4.4.3 Cosine SimilarityThe Vector Space • vec(D) = (w1, w2, ..., wt) • Sim(d1,d2) = cos() • = [vec(d1)  vec(d2)] / |d1| * |d2| = [ wd1(j) * wd2(j)] / |d1| * |d2| • W(j) > 0 whenever j di • So, 0 <= sim(d1,d2) <=1 • A document is retrieved even if it matches the query terms only partially

  21. 4.4.3 Cosine Similarity • How to compute the weight wj? • A good weight must take into account two effects: • quantification of intra-document contents (similarity) • tf factor, the term frequency within a document • quantification of inter-documents separation (dissi-milarity) • idf factor, the inverse document frequency • wj = tf(j) * idf(j)

  22. 4.4.3 Cosine Similarity • TF in the given document shows how important the term is in this document (makes the frequent words for the documentmore important) • IDF makes rare words across all documentsmore important. • A high weight in a tf-idf ranking scheme is therefore reached by a high term frequency in the given document and a low term frequency in all other documents. • Term weights in a document affects the position of the document vectors • di = (wi,1 , wi,2 ….wi,t)

  23. 4.4.3 Cosine Similarity TF-IDF definitions: fik: number occurrences of term ti in document Dk tfik: fik / max(fik) normalized term frequency dfk: number of documents which contain tk idfk: log(N / dfk) where N is the total number of documents wik: tfik idfk term weight Intuition: rare words get more weight, common words less weight

  24. Example TF-IDF • Given a document containing terms with given frequencies: Kent = 3; Ohio = 2; University = 1 and assume a collection of 10,000 documents and document frequencies of these terms are: Kent = 50; Ohio = 1300; University = 250. THEN Kent: tf = 3/3; idf = log(10000/50) = 5.3; tf-idf = 5.3 Ohio: tf = 2/3; idf = log(10000/1300) = 2.0; tf-idf = 1.3 University: tf = 1/3; idf = log(10000/250) = 3.7; tf-idf = 1.2

  25. 4.4.3 Cosine Similarity • Cosine • W(j) = tf(j) * idf(j) • Idf(j) = log(N / df(j))

More Related