1 / 29

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling. Alexander Gelbukh www.Gelbukh.com. Previous chapter. User Information Need Vague Semantic, not formal Document Relevance Order, not retrieve Huge amount of information Efficiency concerns

Download Presentation

Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Special Topics in Computer ScienceAdvanced Topics in Information RetrievalChapter 2: Modeling Alexander Gelbukh www.Gelbukh.com

  2. Previous chapter • User Information Need • Vague • Semantic, not formal • Document Relevance • Order, not retrieve • Huge amount of information • Efficiency concerns • Tradeoffs • Art more than science

  3. Modeling • Still science: computation is formal • No good methods to work with (vague) semantics • Thus, simplify to get a (formal) model • Develop (precise) math over this (simple) model Why math if the model is not precise (simplified)? phenomenon  model = step 1 = step 2 = ... = result math phenomenon  model  step 1  step 2  ...  ?!

  4. Modeling • Substitute a complex real phenomenon with a simple model, which you can measure and manipulate formally • Keep only important properties (for this application) • Do this with text: 

  5. Modeling in IR: idea • Tag documents with fields • As in a (relational) DB: customer = {name, age, address} • Unlike DB, very many fields: individual words! • E.g., bag of words: {word1, word2, ...}: {3, 5, 0, 0, 2, ...} • Define a similarity measure between query and such a record • (Unlike DB) Rank (order), not retrieve (yes/no) • Justify your model (optional, but nice) • Develop math and algorithms for fast access • as relational algebra in DB

  6. Taxonomy of IR systems

  7. Aspects of an IR system • IR model • Boolean, Vector, Probabilistic • Logical view of documents • Full text, bag of words, ... • User task • retrieval, browsing Independent, though some are more compatible

  8. Appropriate models

  9. Characterization of an IR model • D = {dj}, collection of formal representations of docs • e.g., keyword vectors • Q = {qi}, possible formal representations of user information need (queries) • F, framework for modeling these two: reason for the next • R(qi,dj): Q D  R, ranking function • defines ordering

  10. Specific IR models

  11. IR models • Classical • Boolean • Vector • Probabilistic (clear ideas, but some disadvantages) • Refined • Each one with refinements • Solve many of the problems of the “basic” models • Give good examples of possible developments in the area • Not investigated well • We can work on this

  12. Basic notions • Document: Set of index term • Mainly nouns • Maybe all, then full text logical view • Term weights • some terms are better than others • terms less frequent in this doc and more frequent in other docs are less useful • Documents  index term vector {w1j, w2j, ..., wtj} • weights of terms in the doc • t is the number of terms in all docs • weights of different terms are independent (simplification)

  13. Boolean model • Weights  {0, 1} • Doc: set of words • Query: Boolean expression • R(qi,dj)  {0, 1} • Good: • clear semantics, neat formalism, simple • Bad: • no ranking ( data retrieval), retrieves too many or too few • difficult to translate User Information Need into query • No term weighting

  14. Vector model • Weights (non-binary) • Ranking, much better results (for User Info Need) • R(qi,dj) = correlation between query vector and doc vector • E.g., cosine measure: (there is a typo in the book)

  15. Projection

  16. Weights • How are the weights wijobtained? Many variants. One way: TF-IDF balance • TF: Term frequency • How well the term is related to the doc? • If appears many times, is important • Proportional to the number of times that appears • IDF: Inverse document frequency • How important is the term to distinguish documents? • If appears in many docs, is not important • Inversely proportional to number of docs where appears • Contradictory. How to balance?

  17. TF-IDF ranking • TF: Term frequency • IDF: Inverse document frequency • Balance: TF  IDF • Other formulas exist. Art.

  18. Advantages of vector model One of the best known strategies • Improves quality (term weighting) • Allows approximate matching (partial matching) • Gives ranking by similarity (cosine formula) • Simple, fast But: • Does not consider term dependencies • considering them in a bad way hurts quality • no known good way • No logical expressions (e.g., negation: “mouse & NOT cat”)

  19. Probabilistic model • Assumptions: • set of “relevant” docs, • probabilities of docs to be relevant • After Bayes calculation: probabilities of terms to be important for defining relevant docs • Initial idea: interact with the user. • Generate an initial set • Ask the user to mark some of them as relevant or not • Estimate the probabilities of keywords. Repeat • Can be done without user • Just re-calculate the probabilities assuming the user’s acceptance is the same as predicted ranking

  20. (Dis)advantages of Probabilistic model Advantage: • Theoretical adequacy: ranks by probabilities Disadvantages: • Need to guess the initial ranking • Binary weights, ignores frequencies • Independence assumption (not clear if bad) Does not perform well (?)

  21. Alternative Set Theoretic modelsFuzzy set model • Takes into account term relationships (thesaurus) • Bible is related to Church • Fuzzy belonging of a term to a document • Document containing Bible also contains “a little bit of” Church, but not entirely • Fuzzy set logic applied to such fuzzy belonging • logical expressions with AND, OR, and NOT • Provides ranking, not just yes/no • Not investigated well. • Why not investigate it?

  22. Alternative Set Theoretic modelsExtended Boolean model • Combination of Boolean and Vector • In comparison with Boolean model, adds “distance from query” • some documents satisfy the query better than others • In comparison with Vector model, adds the distinction between AND and OR combinations • There is a parameter (degree of norm) allowing to adjust the behavior between Boolean-like and Vector-like • This can be even different within one query • Not investigated well. Why not investigate it?

  23. Alternative Algebraic modelsGeneralized Vector Space model • Classical independence assumptions: • All combinations of terms are possible, none are equivalent (= basis in the vector space) • Pair-wise orthogonal: cos ({ki}, {kj}) = 0 • This model relaxes the pair-wise orthogonality:cos ({ki}, {kj})  0 • Operates by combinations (co-occurrences) of index terms, not individual terms • More complex, more expensive, not clear if better • Not investigated well. Why not investigate it?

  24. Alternative Algebraic modelsLatent Semantic Indexing model • Index by larger units, “concepts”  sets of terms used together • Retrieve a document that share concepts with a relevant one (even if it does not contain query terms) • Group index terms together (map into lower dimensional space). So some terms are equivalent. • Not exactly, but this is the idea • Eliminates unimportant details • Depends on a parameter (what details are unimportant?) • Not investigated well. Why not investigate it?

  25. Alternative Algebraic modelsNeural Network model • NNs are good at matching • Iteratively uses the found documents as auxiliary queries • Spreading activation. • Termsdocs terms docstermsdocs ... • Like a built-in thesaurus • First round gives same result as Vector model • No evidence if it is good • Not investigated well. Why not investigate it?

  26. Models for browsing • Flat browsing: String • Just as a list of paper • No context cues provided • Structure guided: Tree • Hierarchy • Like directory tree in the computer • Hypertext (Internet!): Directed graph • No limitations of sequential writing • Modeled by a directed graph: links from unit A to unit B • units: docs, chapters, etc. • A map (with traversed path) can be helpful

  27. Research issues • How people judge relevance? • ranking strategies • How to combine different sources of evidence? • What interfaces can help users to understand and formulate their Information Need? • user interfaces: an open issue • Meta-search engines: combine results from different Web search engines • They almost do not intersect • How to combine ranking?

  28. Conclusions • Modeling is needed for formal operations • Boolean model is the simplest • Vector model is the best combination of quality and simplicity • TF-IDF term weighting • This (or similar) weighting is used in all further models • Many interesting and not well-investigated variations • possible future work

  29. Thank you! Till March 22, 6 pm

More Related