1 / 26

Automatic Indexing (Term Selection)

Automatic Indexing (Term Selection). Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989. Automatic Indexing. Indexing: assign identifiers (index terms) to text documents. Identifiers: single-term vs. term phrase

Download Presentation

Automatic Indexing (Term Selection)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.

  2. Automatic Indexing • Indexing: • assign identifiers (index terms) to text documents. • Identifiers: • single-term vs. term phrase • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, … • objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

  3. Two Issues • Issue 1: indexing exhaustivity • exhaustive: assign a large number of terms • nonexhaustive • Issue 2: term specificity • broad terms (generic)cannot distinguish relevant from nonrelevant documents • narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant

  4. Term-Frequency Consideration • Function words • for example, "and", "or", "of", "but", … • the frequencies of these words are high in all texts • Content words • words that actually relate to document content • varying frequencies in the different texts of a collect • indicate term importance for content

  5. A Frequency-Based Indexing Method • Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. • Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di. • Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T.

  6. How to compute wij ? • Inverse document frequency, idfj • tfij*idfj (TFxIDF) • Term discrimination value, dvj • tfij*dvj • Probabilistic term weightingtrj • tfij*trj • Global properties of terms in a document collection

  7. Inverse Document Frequency • Inverse Document Frequency (IDF) for term Tjwhere dfj (document frequency of term Tj) is the number of documents in which Tj occurs. • fulfil both the recall and the precision • occur frequently in individual documents but rarely in the remainder of the collection

  8. TFxIDF • Weight wij of a term Tj in a document di • Eliminating common function words • Computing the value of wij for each term Tj in each document Di • Assigning to the documents of a collection all terms with sufficiently high (tfxidf) factors

  9. Term-discrimination Value • Useful index terms • Distinguish the documents of a collection from each other • Document Space • Two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together • When a high-frequency term without discrimination is assigned, it will increase the document space density

  10. A Virtual Document Space After Assignment of good discriminator After Assignment of poor discriminator Original State

  11. Good Term Assignment • When a term is assigned to the documents of a collection, the few objects to which the term is assigned will be distinguished from the rest of the collection. • This should increase the average distance between the objects in the collection and hence produce a document space less dense than before.

  12. Poor Term Assignment • A high frequency term is assigned that does not discriminate between the objects of a collection. Its assignment will render the document more similar. • This is reflected in an increasein document spacedensity.

  13. Term Discrimination Value • Definitiondvj = Q - Qjwhere Q and Qj are space densities before and after the assignments of term Tj. • dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

  14. Variations of Term-Discrimination Value with Document Frequency Document Frequency N Low frequency dvj=0 Medium frequency dvj>0 High frequency dvj<0

  15. TFijx dvj • wij = tfijx dvj • compared with • : decrease steadily with increasing document frequency • dvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

  16. Document Centroid • Issue: efficiency problemN(N-1) pairwise similarities • Document centroidC = (c1, c2, c3, ..., ct)where wij is the j-th term in document i. • Space density

  17. Probabilistic Term Weighting • GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection • DefinitionGiven a user query q, and the ideal answer set of the relevant documents • From decision theory, the best ranking algorithm for a document D

  18. Probabilistic Term Weighting • Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance • Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets

  19. Assumptions • Terms occur independently in documents

  20. Derivation Process

  21. Given a document D=(d1, d2, …, dt) Assume di is either 0 (absent) or 1 (present). For a specific document D Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-pi Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi

  22. Term Relevance Weight

  23. Issue • How to computepjand qj? pj = rj / R qj = (dfj-rj)/(N-R) • R: the total number of relevant documents • N: the total number of documents

  24. Estimation of Term-Relevance • The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collectionqj= dfj/ N • The occurrence probabilities of the terms in the small number of relevant documents is equal by using a constant value pj= 0.5 for all j.

  25. Comparison = idfj  When N is sufficiently large, N-dfj N,

More Related