460 likes | 582 Views
Introduction to Biomedical Informatics Information Retrieval . Outline. Introduction: basic concepts Document-query matching methods TD-IDF Latent semantic indexing Evaluation methods Web-scale retrieval Example: PubMed Additional Resources and Recommended Reading.
E N D
Outline • Introduction: basic concepts • Document-query matching methods • TD-IDF • Latent semantic indexing • Evaluation methods • Web-scale retrieval • Example: PubMed • Additional Resources and Recommended Reading
Information Retrieval • Generic task: • User has a query Q expressed in some way (e.g., set of keywords) • The user would like to find documents from some corpus D that are most relevant to Q • Information retrieval is the problem of automatically finding and ranking the most relevant documents in a corpus D, given a query Q • Examples: • Q = { lung cancer smoking}, D = 20 million papers in PubMed • Q = { pizza irvine}, D = all documents on the Web
General Issues in Document Querying • What representation language to use for docs and queries • How to measure similarity between Q and each document in D • How to rank the results for the user • Allowing user feedback (query modification) • How to evaluate and compare different IR algorithms/systems • How to compute the results in real-time (for interactive querying)
General Concepts • Corpus D consisting of N documents • Typically represented as an N x d matrix • Each document represented as a vector of d terms • E.g., entry i, j is the number of times term j occurs in the document i • Query Q: • User poses a query to search D • Query is typically expressed as a vector of d terms • Query Q is expressed as a set of words, e.g., “data” and “mining” are both set to 1 and all other terms are set to 0 (so we can think of the query Q as a “pseudo-document”) • Key ideas: • Represent both documents and queries as vectors in some term-space • Matching a query with documents => defining a vector similarity measure
Querying Approaches • Exact-match query: return a list of all exactmatches • Boolean match on text • query = “Irvine” AND “fun”: return all docs with “Irvine” and “fun” • Can generalize to Boolean functions, • e.g., NOT(Irvine OR Newport Beach) AND fun • Not so useful when there are many matches • E.g., “data mining” in Google returns millions of documents • Ranked queries: return a ranked list of most relevant matches • e.g. what record is most similar to a query Q? • Q could itself be a full document or a shorter version (e.g., 1 or a few words) - we will focus here on short (few word) queries • Typical two-stage approach (e.g., in commercial search engines) • First use exact match to retrieve an initial set of documents • Then use more sophisticated similarity measures to rank documents in this set based on how relevant they are to a query Q
“Bag of Words” Representation for Documents • document: book, paper, WWW page, ... • term: word, word-pair, phrase, … (could be millions) • query Q = set of terms, e.g., “data” + “mining” • Full NLP (natural language processing) too hard, so … • want (vector) representation for text which • retains maximum useful semantics • supports efficient distance computes between docs and Q • “bag of words” ignores word order, sentence structure, etc. • Nonetheless works well in practice and widely used • Very computationally efficient compared than dealing with word order • term values • Boolean (e.g. term in document or not); “bag of words” • real-valued (e.g. freq term in doc; relative to all docs) ...
Practical Issues • Tokenization • Convert document to list of word counts • word token = “any nonempty sequence of characters” • challenges: punctuation, equations, HTML, formatting, etc • Special parsers to handle these issues • Canonical forms, Stopwords, Stemming • Typically remove capitalization • But capitalization may be important for proper nouns, e.g., the name “Gene” v “gene” • Stopwords: • remove very frequent words (a, the, and…) – can use standard list • Can also remove very rare words • Stemming (next slide)
Stemming • Want to reduce all morphological variants of a word to a single index term • e.g. a document containing words like fish and fisher may not be retrieved by a query containing fishing (no fishing explicitly contained in the document) • Stemming - reduce words to their root form • e.g. fish – becomes a new index term • Porter stemming algorithm (1980) • relies on a preconstructed suffix list with associated rules • e.g. if suffix=IZATION and prefix contains at least one vowel followed by a consonant, replace with suffix=IZE • BINARIZATION => BINARIZE • Not always desirable: e.g., {university, universal} -> univers (in Porter’s) • Alternatives include WordNet which is a dictionary-based approach
Google n-gram Data Data made available for research by Google in 2006 • EXAMPLE OF TRIGAMS: • ceramics collectables collectibles 55ceramics collectables fine 130ceramics collected by 52ceramics collectible pottery 50ceramics collectibles cooking 45ceramics collection , 144ceramics collection . 247ceramics collection </S> 120ceramics collection and 43
Document Similarity • Measuring similarity between 2 document term vectors x and y: • wide variety of distance metrics: • Euclidean (L2) = sqrt(i(xi - yi)2) • L1 = I |xi - yi | • ... • weighted L2 = sqrt(i(wixi- wiyi)2) • Cosine distance between docs: • often gives better results than Euclidean • normalizes relative to document length
Distance matrices for toy document-term data Euclidean Distances TF doc-term matrix t1 t2 t3 t4 t5 t6 d1 24 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Cosine Distances
TF-IDF Term Weighting Schemes • Not all terms in a query or document may be equally important... • TF (term freq): term weight = number of times in that document • problem: term common to many docs => low discrimination, e.g., “medical” • IDF (inverse-document frequency of a term) • njdocuments contain term j, N documents in total • IDF = log(N/nj) • Favors terms that occur in relatively few documents • TF-IDF: TF(term)*IDF(term) • No real theoretical basis, but works very well empirically and widely used
TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) (using natural logs, i.e., loge)
TF-IDF Example TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 Example: TF-IDF for term t1 in doc D1 = TF*IDF = 24 * log(10/9) TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... IDF weights are (0.1, 0.7, 0.5, 0.7, 0.4, 0.7) (using natural logs, i.e., loge)
Simple Document Querying System • Queries Q = binary term vectors • Documents represented by TF-IDF weights • Cosine distance used for retrieval and ranking
Representing a Query as a PseudoDocument Example query Q = [database index]
Example: Query Q = [t1 t3] TF doc-term matrix t1 t2 t3 t4 t5 t6 d124 21 9 0 0 3 d2 32 10 5 0 3 0 d3 12 16 5 0 0 0 d4 6 7 2 0 0 0 d5 43 31 20 0 3 0 d6 2 0 0 18 7 16 d7 0 0 1 32 12 0 d8 3 0 0 22 4 2 d9 1 0 0 34 27 25 d10 6 0 0 17 4 23 TF-IDF doc-term mat t1 t2 t3 t4 t5 t6 d12.5 14.6 4.6 0 0 2.1 d2 3.4 6.9 2.6 0 1.1 0 d3 1.3 11.1 2.6 0 0 0 d4 0.6 4.9 1.0 0 0 0 d5 4.5 21.5 10.2 0 1.1 0 ... Q = (1,0,1,0,0,0) TFTF-IDF d1 0.70 0.32 d2 0.77 0.51 d3 0.58 0.24 d4 0.60 0.23 d50.79 0.43 ...
Synonymy and Polysemy • Synonymy • the same concept can be expressed using different sets of terms • e.g. bandit, brigand, thief • negatively affects recall (i.e., the number of relevant docs returned) • Polysemy • identical terms can be used in very different semantic contexts • bank • bear left at the zoo • time flies like an arrow • negatively affects precision (i.e., the relevance of the returned docs)
Latent Semantic Indexing • Approximate data in the original d-dimensional space by data in a k-dimensional space, where k << d • Find the k linear projections of the data that contain the most variance • Basic approach is known as principal component analysis or singular value decomposition • Also known as “latent semantic indexing” when applied to text • Captures dependencies among terms • In effect represents original d-dimensional basis with a k-dimensional basis • e.g., terms like SQL, indexing, query, could be approximated as coming from a single “hidden” term • Why is this useful? • Query contains “automobile”, document contains “vehicle” • can still match Q to the document since the 2 terms will be close in k-space (but not in original space), i.e., addresses synonymy problem
optional Singular Value Decomposition (SVD) • M = U S VT • M = n x d = original document-term matrix (the data) • U = n x d , each row = vector of weights for each document • S = d x d diagonal matrix of eigenvalues • Columns of VT = new orthogonal basis for the data • Each eigenvalue represents how much information is of the new “basis” vectors • Typically select just the first k basis vectors, k << d (also known as principal components, or LSI (latent semantic indexing))
optional v1 = [0.74, 0.49, 0.27, 0.28, 0.18, 0.19] v2 = [-0.28, -0.24 -0.12, 0.74, 0.37, 0.31] New documents: D1 = database x 50 D2 = SQL x 50
Evaluating Retrieval Methods • Typically there no real ground truth for a query– so how can we evaluate our algorithms? Is A better than B? • Academic research • Use small testbed data sets of documents where human labelers assign a binary label to each document in the corpus, in terms of its relevance to a specific query Q • repeat for different queries • very time-consuming! • Real-world (e.g., Web search) • Can use click data as a surrogate indicator for relevancy • Can generate very large amounts of training/test data per query • Both approaches are useful for precision, not so useful for recall
Precision versus Recall Rank documents (numerically) with respect to query Compute precision and recall by thresholding the rankings precision fraction of retrieved objects that are relevant recall fraction of retrieved relevant objects / total relevant objects Tradeoff: high precision -> low recall, and vice-versa Similar to receiver-operating characteristic (ROC) in concept For multiple queries, precision for specific ranges of recall can be averaged (so-called “interpolated precision”).
Precision-Recall Curve (form of ROC) Instead of evaluatingthe entire curve, wecan also look at, e.g.,(a) precision at fixedrecall (e.g., 10%)or (b) precision whenprecision=recall C is universally worse than A and B
TREC evaluations • Text Retrieval Conference (TReC) • Web site: trec.nist.gov • Annual impartial evaluation of IR systems • e.g., D = 1 million documents • TREC organizers supply contestants with several hundred queries Q • Each competing system provides its ranked list of documents • Union of top 100 ranked documents or so from each system is then manually judged to be relevant or non-relevant for each query Q • Precision, recall, etc, then calculated and systems compared
Other Examples of Evaluation Data Sets • Cranfield data • Number of documents = 1400 • 225 Queries, “medium length”, manually constructed “test questions” • Relevance = determined by expert committee (from 1968) • Newsgroups • Articles from 20 Usenet newsgroups • Queries = randomly selected documents • Relevance: is the document d in the same category as the query doc?
Related Types of Data • Sparse high-dimensional data sets with counts, like document-term matrices, are common in data mining, e.g., • “transaction data” • Rows = customers • Columns = products • Web log data (ignoring sequence) • Rows = Web surfers • Columns = Web pages • Recommender systems • Given some products from user i, suggest other products to the user • e.g., Amazon.com’s book recommender • Collaborative filtering: • use k-nearest-individuals as the basis for predictions • Many similarities with querying and information retrieval • e.g., use of cosine distance to normalize vectors
Practical Issues: Computation Speed • Say you are doing information retrieval at “Google-scale” • e.g., 100 billion documents in the corpus, 1 million terms in your vocabulary • So given a new query Q, you have to compute 100 billion distance calculations, each involving 1 million terms • How can this be done in “near real-time” (e.g., 200 milliseconds) • Sparse data structures • e.g., 3 column: <docidtermid count> • Vastly reduces memory requirements and speeds up search • Inverted index • List of sorted <termiddocid> pairs • useful for quickly finding only the docs that contain query terms (stage 1) • Massively parallel processing • Different sets of docs on different processors, results then pooled, ranked
Aspects of Web-based Search • Additional information in Web documents • Link structure (e.g., PageRank algorithm) • HTML structure • Link/anchor text • Title text • Etc • Can be leveraged for better retrieval • Additional issues in Web retrieval • Scalability: size of “corpus” is huge (10’s of billions of docs) • Constantly changing: • Crawlers to update document-term information • need schemes for efficient updating indices • Evaluation is more difficult – how is relevance measured? How many documents in total are relevant?
Example: Google Search Engine • Offline: • Continuously crawl the Web to create index of Web documents • Create large-scale distributed inverted index • Real-time: a user issues a query q • Parallel processing used to find documents that match exactly to q (might be 1 million documents) • These documents are then scored based on 100 or more features • Scoring is typically a logistic regression model learned from past search data for this query q, where 1 = user clicked on a link and 0 = no click • Top 10 scoring links are displayed to the user • May be personalized (based on past search) and localized • All of this has to happen in about ½ a second!
Example: PubMed System http://www.ncbi.nlm.nih.gov/pubmed/ • PubMed • Free biomedical literature search service maintained by NCBI (NIH) • Over 21 million papers indexed • Abstracts, citations, etc for over 5000 life-science journals, back to 1948 • The most widely-used Web tool for searching the biomedical literature • Several million queries per day
PubMed Querying • Basic query = Boolean functions of keywords • E.g., Query = (stomach OR liver) AND cancer NOT smoking • Implicit “AND’s inserted between keywords • e.g., NOT smoking is really AND NOT smoking in the query above • Advanced search allows one to define queries on additional fields such as author, date, journal, MESH term, language, etc • Query is extended to include MeSH terms • If any keyword can be mapped to MeSH, then PubMed also retrieves all documents indexed by this MeSH term • Ranking is done in reverse chronological order • Queries often return many docs, e.g., over 247,000 docs for query above
Systems that go beyond PubMed From, Lu, PubMed and Beyond, Database, Vol 2011
Systems that go beyond PubMed From, Lu, PubMed and Beyond, Database, Vol 2011
Systems that go beyond PubMed From, Lu, PubMed and Beyond, Database, Vol 2011
Using Text Mining to Interpret Queries From Krallinger, Valencia, Hirschman, Genome Biology, 2008
Further Reading See class Web page for various pointers Information retrieval in health/biomedical context Information Retrieval: A Health and Biomedical Perspective, W. Hersh, Springer, 2009 Very useful reference on indexing and searching text: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edition, Morgan Kaufmann,1999,by Witten, Moffat, and Bell Web-related Document Search: An excellent resource on Web-related search is Chapter 3, Web Search and Information Retrieval, in Mining the Web: Discovering Knowledge from Hypertext Data, S. Chakrabarti, Morgan Kaufmann, 2003. Practical aspects of how real Web search engines work: http://searchenginewatch.com/ Latent Semantic Analysis Applied to grading of essays: The debate on automated essay grading, M. Hearst et al, IEEE Intelligent Systems, September/October 2000.