400 likes | 563 Views
INFO624 - Week 2 Models of Information Retrieval. Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University. Reviews of Last Week. Challenges of Information Retrieval Translate user’s information needs to queries. Match queries to stored information.
E N D
INFO624 - Week 2Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University
Reviews of Last Week • Challenges of Information Retrieval • Translate user’s information needs to queries. • Match queries to stored information. • Evaluate if the query results match the user’s information needs • Differences between • Data, information, and knowledge • Data retrieval and information retrieval
Assignment 1 • Some of my favorite Search Software Packages • IBM’ Content Management (high-cost) • AOL PLS Search Engine (free) • GreenStone Digital Library Software (open-source) • SWISH (open source) • mnoGoSearch (free) • Apache Lucene (open source components)
Documents • Documents are logical units of text • Units of records (text & other components) • Units that can be stored, retrieved, and displayed as an unique entity • Units of semantic entity • units of text grouped together for a purpose • Units of unformatted text • Text as written by authors of documents.
Document Models • Documents need to be processed and represented in a concise and identifiable formats/structures. • Documents are full of text. • Not every words of the text are meaningful for searching/retrieval. • Documents themselves do not have identifiable attributes such as authors and titles.
Figure 1.2:Logical view of a document: from full text to a set of index terms.
Document Representation • Documents should be represented to help users identify and receive information from the system. • to identify authors and titles • to identify subjects • to provide summaries/abstracts • to classify subject categories
Document Surrogates • Each document should have one or more short and descriptive labels/attributes • Level 1: • Title: • Author: • Keywords: • Level 2: • Level 1 +Abstract: • Level 3: • Level 2 + full text
A Formal IR Models • An information retrieval model is a quadruple (D, Q, F, R(qi, dj)) where • D is a set composed of logical views (or representations) for the documents in the collection. • Q is a set composed of logical views (or representations) for the information needs. Such representations are called queries. • F is a framework for modeling document representations, queries, and their relationships • R(qi, dj) is a ranking function which associated a real number with a queryqi and a document representation dj. Scuh ranking defines an ordering among the documents with regard to the query qi.
Computerized Indexing • Title indexing: • Sort all the titles alphabetically • Not consider the beginning “a” or “the” • Convert all letters to uppercases. • Matching always starts from the beginning of the title (not individual words). • Most early IR systems (such as library catalogs) used title indexing
Word indexing • Parsing every individual words from documents • First decision: What is a word? • Are digits words? • How about the letter and digit combination: B6, B12 • Is F-16 one word or two words? • Hyphens • Online, on-line, on line ? • F-16 • Singular or plural ? • List all the words alphabetically with points back to documents – inverted indexing.
Inverted Indexing • Inverted indexing consists of an ordered list of indexing terms, each indexing term is associated with some document identification numbers. • Retrieval is done by first searching in the ordered list to find the indexing term, then using the document identification numbers to locate documents
Boolean Logic • Logical operators defined on sets • True and false: • A set is a collection of items with certain common characteristics. • Any item either belongs to the set (true) or not belong to the set (false) • AND • combine two sets, A and B, to create a smaller (or at least not larger) set C. • any items in C must be in BOTH set A and set B. • OR • Union of two sets, A and B, to create a larger set C. • any item in C must be either in set A or in set B. • Not • to exclude items in a set.
Example: • Given: A={1, 3, 7, 12, 14, 25,36,} B={1, 2, 3,4,5,7,8,12,13, 14, 15, 25, 26} C={2,4,6,8,10,11,12,13,14} • Derive: • A AND B • A OR B • A AND B AND C • (A AND B) NOT C • (A AND B) OR C • (A OR B) AND C • A AND (B OR C)
Boolean Logic • Venn Diagram • graphical representation of Boolean logic • A and (B or C) • A and B or (C and D)
Boolean Query • Terms connected by Boolean operators • The system retrieves a set of documents based on the Boolean logic of the query. • Examples: • (network or networks or structured or system or systems) and (information or retrieval)
Advantages of Boolean Search • Simple and specific • Effective • AND reduces the number of hits very quickly • OR expands search scope • Strong logic-based • proved mathematical foundations
Problems of Boolean Search: • Boolean search is an exact search • either retrieving or not retrieving a document. • Requesting “computer” would not find “computing” unless more programming is done • No weighting can be done on terms • in query, A and B, you can’t specify A is more important than B.
No Ranking • Retrieved sets can not be ordered based on the Boolean logic. • Every retrieved document are treated equally. • Possible order confusion • A AND B OR C
Vectors • A numerical representation for a point in a multi-dimensional space. • (x1, x2, … … xn) • Dimensions of the space need to be defined • A measure of the space needs to be defined.
Vector Representation of Document Space • Each indexing term is a dimension • Each document is a vector • Di = (ti1, ti2, ti3, ti4, ... tin) • Dj = (tj1, tj2, dj3, tj4, ..., tjn) • Document similarity is defined as
Example: • A document Space is defined by three terms: • hardware, software, user • A set of documents are defined as: • A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1) • A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1) • A7=(1, 1, 1) A8=(1, 0, 1). A9=(0, 1, 1) • If the Query is “hardware and software” • what documents should be retrieved?
In Boolean query matching: • document A4, A7 will be retrieved by “ANDing” the two query terms • retrieved:A1, A2, A4, A5, A6, A7, A8, A9 if two query terms are “ORed” together. • In Vector query matching: • q=(1, 1, 0) • S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0 • S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5 • S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5 • Document retrieved set (with order)= • {A4, A7, A1, A2, A5, A6, A8, A9}
Weights in the Vector Space • A main advantage of Vector representation is that items in vectors don’t have to be just 0 or 1 (true or false). • A1=(0.7, 0.5, 0.3) • A2=(0.5, 0.2, 0.7) • A3=(0.3, 0.6, 0.9) • A4=(0.7, 0.9, 1.0) • Queries may also be weighted: • Q=(0.7, 0.3, 0)
TF and IDF • TF – term frequency • number of times a term occurs in a document • DF –Document frequency • Number of documents that contain the term. • IDF – inversed document frequency • =log(N/ni) • N –the total number of documents • ni – number of documents that contains term i.
Salton’s Vector Space • A document is represented as a vector: • (W1, W2, … … , Wn) • Binary: • Wi= 1 if the corresponding term is in the document • Wi= 0 if the term is not in the document • TF: (Term Frequency) • Wi= tfi where tfi is the number of times the term occurred in the document • TF*IDF: (Inverse Document Frequency) • Wi =tfi*idfi=tfi*(1+log(N/dfi))where dfi is the number of documents contains the term i, and N the total number of documents in the collection.
In vector space, documents and queries are treated the same. • It is easier to do similarity search: • “find documents like this one” • It is easier to do document clusters: • “group documents into categories and subcategories” • It’s easier to display search results graphically • “Giving meaning to place or location in the multi-dimensional space”
Web Indexing • Most web indexing is Vector-based indexing, with variances: • robot indexing software keeps traverse the web to collect more pages and terms • Servers establish a huge inverted indexing and vector indexing database • Search engines conduct different types of vector query matching • only a few search engines implement truly Boolean query matching
The real differences among different search engines are • their indexing weight schemes • their query process methods • their ranking algorithms • None of these are published by any of the search engines firms.
Alternative IR Models • Probabilistic Model • Given a document d, how likely would the user consider it relevant? • How likely would the user consider it no relevant? • If these two are known, Similarity of document d and query q can be defined as: • S(d, q) = probability of d is relevant to q probability of d is not relevant to q
Examples: • If a document is 80% likely to be relevant to query q, what is its (probabilistic) similarity? • If a document is only 30% likely to be relevant, what is the similarity?
If there are 100 documents, 10 are relevant to a query, • what is the probability of relevance for a randomly select document? • What is the similarity of this document to the query? • Any retrieve systems must do must better than that. • In general, retrieval systems should retrieve those S>1
Advantages of the Probabilistic model • Documents can be ranked by its relevance probability. • Relevance probability can be improved through the interaction process. • Good mathematic model • Disadvantages: • Involved many assumptions • Not very practical
Fuzzy Set Model • Fuzzy Set Theory • Extension of Boolean set theory • Instead of a binary membership definition, fuzzy set Membership is continuously defined between 0 and 1. • Example: • { Male students in our class} • {tall students in our class} • One is Boolean set and one is fuzzy set.
The set of retrieved documents should be considered as a fuzzy set. • Documents are not just relevant or not-relevant. • Documents can be somehow relevant. • Documents can be 80% likely to be relevant. • Good Mathematical Models but not widely implemented and tested.
Latent Semantic Indexing Model • Map documents from a high-dimensional space to a lower dimensional space, while maintaining document relationships. • For clustering • For visualization • It’s a popular advanced retrieval technique. • It’s computationally expensive.
Neural Network Model • Organize the document collection as a semantic network through learning • Use known queries/relevant documents to to train the network, and later allow the network to predict relevance for new queries. (supervised learning) • Use document-document relationships to “self-organize” the network and move relevant documents close to each other. (un-supervised learning).
The Fusion Model • Retrieve documents based on text indexing (Boolean model or Vector Space Model, etc.) • Retrieve documents based on link models (Citations, Google’s PageLink, etc.)\ • Retrieve documents based on classification models (The classification schemes, thesauri, Yahoo categories, etc). • “Fusion” results together before response to the user
Models for Browsing • Flat Model • No particular organizations of materials • Hierarchical model • Assign documents into a hierarchical structure. • Hypertext Model • Define appropriate links among related documents.