480 likes | 656 Views
Information Representation: Vector Space and Beyond. Lecture 5. Outline. Representation Big Picture Beyond Boolean Leveraging Natural Word Distribution Standard Word Weighting Vectors-Space (Foundation) Refinement Relevance Feedback Probabilistic Approach Stop word list Thesaurus
E N D
Information Representation: Vector Space and Beyond Lecture 5
Outline • Representation • Big Picture • Beyond Boolean • Leveraging Natural Word Distribution • Standard Word Weighting • Vectors-Space (Foundation) • Refinement • Relevance Feedback • Probabilistic Approach • Stop word list • Thesaurus • Clustering • Access • Public Web Resource (PubMed) • Customized Interfaces
The Big Picture (TBP) • Users need information to satisfy some need …
TBP • Research so far has focused on scholarly information need (to be satisfied through textual information) …
TBP • Assumptions: • Documents can be converted to computable structures • Users will transform (with or without machine assistance) their information needs to queries • Machines can accurately match documents to queries
A more detailed picture: IR Process Surrogates Documents Information need statement User Query Profile
IR Process Decomposition: Documents • Documents cannot be searched as they are • Too many and too large • The transformation process of documents to surrogates is a critical step: REPRESENTATION • Representation is generally conducted a priori • Aim: identify “key” terms or term-patterns and associate these with documents • Aim: to make searching more efficient may involve grouping, clustering, or classification operations
IR Process Decomposition: Users • User needs have to be constrained • Too ambiguous • To be isomorphic to document representations • The transformation of user needs is dependent on the QUERY MODEL used • Numerous approaches have been proposed and tested: • Exact vs. Weighted • Keyword match • Boolean • Vector-space • Probabilistic
Representation - Boolean • Boolean search is also called exact-match search • It matches words in queries with words in the documents based on morphological comparisons (word forms) • It permits use of basic logical operators: AND, OR, NOT, etc.
Boolean • Exact-match search is easy to use! • User’s terms can directly be used and the response can be easily interpreted • But, exact-match searches have their limitations too ...
Limitations of Boolean • Terms may have synonyms or closely related terms that are morphologically different • For example, the term anxiety may be closely related to the term depression in the literature (database)
Limitations of Boolean • If the user is unaware of relationships or chooses to ignore the relationships then the retrieved set may be incomplete! • In full-text searching a single occurrence may trigger retrieval -- degree of relevance not accounted for
Beyond Boolean • Identifying the “central” concepts in documents is difficult and may be inconsistent across indexers • Recognizing relationships among words is not always easy
Distribution of Words • Automated approaches were developed to: • index based directly on content of documents (not interpretation) • take advantage of patterns of word usage • to automatically identify “relevant” terms • multiple terms thus extracted can capture relationships • degree of relevance can also be established if weighted schemes are used
Word Distribution • Luhn (an IBM research scientist) proposed that documents should be indexed based on words in the documents • He based this assertion on the notion that word distribution among documents is not “random”, rather it is deterministic (can be predicted)
Word Distribution • According to Luhn certain extremely low frequency and high frequency terms can be ignored • Terms with medium frequency that actually appear in documents can be selected as the index terms for individual documents
Standard Word Weighting • Salton seized on the idea proposed by Luhn and extended it • He developed the vector-space model for document representation • Simply, in this model documents and queries are represented as an array of values (a vector) • Then document and query vectors are matched for retrieval purposes
Vector-Space Model • Let us assume, we have n index terms in a database, then all the documents in the database would be represented as vectors of the following form: T1 T2 T3 T4 T5 … Tn [W1 W2 W3 W4 W5 … Wn] = vector Above, w1 is the weight corresponding to the term T1, w2 is the weight corresponding to the term T2, and so on.
Sample Documents TI: The structure of negative emotions in a clinical sample of children and adolescents. SO: Journal of Abnormal Psychology PY: Feb98, Vol. 107 Issue 1, p74 IS: 12p NT: 0021843X AU: Chorpita, Bruce F.Albano, Anne Marieet al AB: Presents a study which focuses on the factors associated with childhood anxietyand depression with the use of a structural equations/confirmatory factor-analytic approach. Reference to a sample of 216 children and adolescents with diagnoses of an anxiety disorder or comorbid anxiety and mood disorders; Suggestion of results; Discussion on the implications for the assessment of childhood negative emotions. CO: 276712 TI: Depression: A family affair. SO: Lancet PY: 01/17/98, Vol. 351 Issue 9097, p158 IS: 1p NT: 00995355 AU: Faraone, Stephen V.Biederman, Joseph AB: Considers the studies of major depressionand anxiety disorders. The findings with regard to depression being familial and having a genetic component to its complex etiology; Discusses the continuity between child and adult psychiatric disorders, psychiatric comorbidity and the underidentification and treatment of juvenile depression. CO: 116735 • Two abstracts:
Simple Binary Representation • If we index using the two terms anxiety and depression, the representations for the previous two documents would be: T1 T2 T3 T4 T5 [0 0 0 1 1 ] = Document Vector Assuming: 1) T4 = Anxiety and T5 = Depression 2) Terms T1, T2, & T3 are not present in the documents 3) Binary representation
Matching • When a user issues a query to the system, the query is also converted to a vector before a matching is performed • Example: If the user enters the term anxiety as the query term, then the vector for this query would be: T1 T2 T3 T4 T5 [0 0 0 1 0] = Query Vector
Matching • A simple matching technique called inner product can be applied to compute similarity between a query vector and document vectors • Let’s assume we have another document in which the terms anxiety and depression do not appear. Then the vector for that document is: T1 T2 T3 T4 T5 [1 0 0 0 0 ] = Another document
Matching • The similarity computation for the other document would produce a result of 0, as following: Similarity (query, document) = Q x D vectors = [0 0 0 1 0] = Query Vector X [1 0 0 0 0] = Another document ------------------------------------------------------------------- 0+ 0+ 0+ 0+ 0 = 0
Matching • The inner product similarity computation for the original document vectors would produce a result of 1, as following: Similarity (query, document) = Q x D vectors = [0 0 0 1 0] = Query Vector X [0 0 0 1 1] = Original documents ------------------------------------------------------------------- 0+ 0+ 0+ 1+ 0 = 1
Matching • If the retrieval rule is that similarity result should be > 0 to be considered relevant, the other document would not be retrieved • Note, if the user were to enter depression instead of anxiety the same two documents would have been retrieved • The two documents were “automatically” indexed using both terms
Improving Matching • Note, documents that contain a single occurrence of a significant term would be treated the same as a document that contains the same term multiple times in the binary representation • To distinguish better between different documents based on word frequencies a different representation is needed
Improved Matching • Salton in fact suggested using a weighting scheme more precise than binary representation • He proposed usage of term frequencies as the initial weight for individual terms • Then, he suggested that each weight should be calibrated using the inverse document frequencies of terms
Improved Matching • The formula for this approach is: Wt = Term Frequency of t x Inverse Document Frequency Inverse Document Frequency = log (Number of documents in the database / Number of documents with the term t)
Improved Matching • The idea behind the inverse document frequency is that the true discrimination value for a term should be based on • The size of the document set (database size) • Overall distribution of the term in a database
Improved Matching • If a term appears many times in many documents than it is considered to have “low” discrimination value • Conversely, if a term appears multiple times in a few documents and relatively few times in the other documents than the term is said to have “high” discrimination value
Ranking • Using the same inner product similarity computations different similarity values would be produced for documents when term weights (frequencies) are considered • Then, the output of the system can be ranked using the similarity values • Demo Break: SIFTER
Refinement • One simple refinement is to divide the inner product result using a value based on the length of the query and document vectors (cosine similarity) -- this is a standard normalization approach that accounts for variability in document and query sizes
Refinement - Relevance Feedback • After one iteration of retrieval cycle, when a document set is retrieved based on a given query additional information can be provided to the system as “feedback” • One common approach used is to ASK the USER to identify from the retrieved set documents that the USER considers as relevant
Relevance Feedback • Based on the feedback the user provides, the query can be modified • New terms appearing in documents considered relevant are selected to be added to the query
Relevance Feedback - Probabilistic Retrieval • Another approach involves re-weighting the term weights in the query • The probabilistic retrieval formula takes into consideration not just word distribution in the overall collection, but distribution of words in relevant documents as well
Further Refinement - Search • Using stop words list certain words can be removed from query or documents before vector generation • A hybrid approach toward indexing -- both original content and human assigned index terms may improve vector representation (if treated as part of the document)
Further Refinement • It is possible to use a controlled vocabulary source such as a thesaurus to provide a “domain bound” on a document source and then use term-weighting to “customize” the document representations for the source • The controlled vocabulary list can also be extremely useful to the user as a search aid
Further Refinement - Thesaurus • It is actually possible to automatically enhance/supplement the thesaurus itself in an on-going fashion
Further Refinement - Thesaurus • Using a randomly sampled subset of the document source • TF-IDF can be used to select a subset of terms, then, document-document similarity measure can cluster documents • Documents with related terms are going to cluster -- each cluster may represent an area that may or may not be covered in the thesaurus
Diversity of resources - Problems on the Other Side • Many sources exist • Certain sources are extremely large • Search language and manipulation can be complex • Sources are dynamic
Diversity of resources • To reduce the problem associated with source explosion, meta search engines have been invented • It is possible to search multiple sources simultaneously using a meta-search engine
Source Complexity • However, certain authoritative sources, such as PubMed or the Protein Sequence database (Human Genome Project), are large and offer many options
Access - Metasearch • Entrez project at the National Center for Biotechnology Information has taken a systematic approach toward developing a search engine supporting search customization & development of innovative interfaces • Entrez site http://www.ncbi.nlm.nih.gov/Entrez/
Frontiers • Research is needed to accommodate diverse needs and user groups (e.g., Music IR is an emerging area) • More fundamental work is needed on representation (user needs as well as documents)
Frontiers • The most popular ( & successful) approaches have been based on statistical approaches not NLP • But, statistical approaches are showing their age • There is evidence that NLP techniques can help improve performance … but with some caveats
Frontiers • IR usually deals with large (gigabyte or million document scale) collections that are heterogeneous • NLP is hard to apply on content from different domains • NLP resources (instead of techniques per se) appear to be more helpful in IR, e.g., dictionaries, lexicons, thesauri, ontologies, taxonomies, etc. • Approaches are needed to combine NLP with statistical techniques • Research is needed to deal with dynamic nature of use and domain-evolution
IR resources • Journals • ACM Transactions on Information Systems • Journal of American Society for Information Science & Technology • Information Processing & Management • Journal of Information Retrieval • Magazines • www.dlib.org
IR Sources on the Web • ACM SIGIR • D-LIB • Digital Libraries Initiatives • Information Filtering