450 likes | 654 Views
LAST WEEK. Retrieval evaluation Why? How? Recall and precision – Venn’s Diagram & Contingency Table. WMES3103 INFORMATION RETRIEVAL. WEEK 5 QUERY LANGUAGES AND OPERATION. QUERY LANGUAGES. Will cover the different kinds of queries sent to text retrieval systems.
E N D
LAST WEEK • Retrieval evaluation • Why? • How? • Recall and precision – Venn’s Diagram & Contingency Table
WMES3103INFORMATION RETRIEVAL WEEK 5 QUERY LANGUAGES AND OPERATION
QUERY LANGUAGES • Will cover the different kinds of queries sent to text retrieval systems. • Will show the different types of query that a user can formulate. • Normally, the main and most popularly used type of user query is the keyword-based retrieval.
Different queries are continuously sent to an IRS. • Most query languages use the content (semantics) and the structure of the text (text syntax) to find the relevant documents. • At times, the IRS may fail to trace and retrieve the relevant documents. • Therefore, we need to use a number of techniques which will hopefully enhance the query and this will enable us to retrieve an acceptable level of relevant documents.
QUERY LANGUAGES • eg. use of thesaurus, synonyms, stemming, stopwords, etc • A keyword is a word that can be retrieved by an IRS. • The retrieval unit is the basic element which can be retrieved by the system as an answer to a query = also known as documents • A retrieval unit can be a file, document, Web page, paragraph, or some other structural unit which contains the answer to the query.
Example : Keyword • Keyword used is : “artificial intelligence”
Example : Retrieval unit • website
Example : Retrieval unit • document
TYPES OF QUERY LANGUAGES • Keyword-based querying • Single-word • Context • Boolean • Natural language • Pattern matching • Structural queries • Form-like fixed • Hypertext • Hierarchical
KEYWORD-BASED QUERYING • Query = formulation of a user information need. • Query =a keyword or a number of keywords = a basic query • Documents containing such keywords are searched for in the IRS. • Keyword-based queries are popular because: • Intuitive • Easy to express • Allows for fast ranking.
Single-word query • Simplest form of query that can be formulated in an IRS. • Text document = long sequences of words. • The IRS will look at the text and search for the word. • Result of a word query = a set of documents containing at least one of the words of the query. • Set of documents will be ranked according to the degree of similarity to the query.
Single-word query • Ranking done via word occurences inside the text • Most popularly used = term frequency = counts the number of times a word appears inside a document
Context query • Singleword queries are complemented with the ability to search for words in a given context = near other words. • Words which appear near other words may indicate a higher possibility of relevance than if they appear apart. • 2 type of queries • phrase • proximity
Phrase – sequence of single-word queries. • Proximity – more relaxed version of the phrase query. • Phrase is given together with a maximum allowed distance between them. • Distance measured in characters or words depending on the system
W/n – first keyword must be within n words of the second keyword. computer w/1 data = the word computer must be within 1 word of the word data = computer generated data, computer simulated data, data mining computer PRE/n – first keyword precede second keyword by up to n words. European pre/1 community = the word European must precede the word community by up to 1 word = European economic community, European flavoured community Example : ABI-INFORM (CD)
Search for a phrase = type in each keyword separated by a space = will search for the phrase with the 2 keywords next to each other and in the specified order artificial intelligence Desired proximity of keywords specified with full stops between keywords back..basics = back to basics, back to the basics Keywords must appear in the same sentence = type in the keywords separated by an underscore computer_medicine Example : COMPENDEX
Boolean query • Oldest form of keyword query = use of Boolean operators • Typical Boolean query = words + operators. • Given 2 basic keyword queries : A and B • A or B - selects all documents with the word A or B. • A and B – selects all documents with A and B • A not B – selects all documents with the word A but without the word B. • Represented by Venn’s Diagram
Natural Language • User determines the keywords that should be eliminated and are not useful for searching. • Ranking for documents with these keywords would be very low.
TARGET - Dialog ? target Input search terms separated by spaces ( e.g. DOG CAT FOOD). You can enhance your TARGET search with the following options: -PHRASES are enclosed in single quotes (e.g. ‘DOG FOOD’) -SYNONYMS are enclosed in parentheses (e.g. (DOG CANINE)) -SPELLING variations are indicated with a ? (e.g. DOG? To search for DOG, DOGS) -Terms that MUST be present are flagged with an asterisk (e.g. DOG *FOOD) Q = QUIT H = HELP ? komodo dragon food diet nutrition Your TARGET search request will retrieve up to 50 of the statistically relevant records. Searching 1997 – 1998 records only … Processing Complete Your search retrieved 50 records Press ENTER to browse results C = Customize display Q = Quit H = Help
Pattern Matching • More specific query formulation • Retrieve pieces of text that have some property. • Used in the retrieval of text statistics, data extraction, etc. • A pattern is a set of syntactic features that must occur in a text segment. • Segments that fulfils the pattern specifications = pattern match
Pattern Matching • Interested in documents containing segments which match the given search pattern. • Each IRS will allow some degree of search pattern. • Very simple or very complex. • The more powerful the set of patterns allowed, the more involved are the queries that can be formulated by the user, and the more complex is the implementation of the search.
Pattern Matching • Words – a word in the text, most basic pattern. • Prefixes – the beginning of a text word – eg. prefix comput will retrieve all documents containing the words such as computers, computing, computation, computational, etc. • Suffixes - the termination of a text word – eg. prefix ters will retrieve all documents containing the words such as monsters, posters, potters, painters, etc.
Pattern Matching • Substrings –can appear within a text word – eg. tal will retrieve all documents containing the words such as coastal, talk, metallic, pedestal, etc. • Ranges – A pair of strings which matches any word lying between them in lexicographical order – eg. range between words held and hold will retrieve strings such as hoax, hissing, helm, help, etc.
Pattern Matching • Allowing errors – A word together with an error threshold • will retrieve all text words which are similar to the given word. • errors are caused by typing, spelling, etc. • most accepted model is the Levenshtein distance or edit distance.
Pattern Matching • Example : Edit distance between COLOR and COLOUR is 1, SURVEY and SURGERY is 2. Therefore, in the query, we must specify the maximum number of allowed errors for a word to match the pattern.
Structural Queries • Based on structure of the text • 3 structures – fixed, hypertext, hierarchical • The user will query the text based on the structure. • Query language nowadays integrates both contents and structural queries.
Example : UM Library OPAC records • Example of query : fi au ali and subject malaysia
Query Protocols • Protocol: a strict set of rules that govern the exchange of information between computer devices • Query languages used automatically by software applications to query text databases. • Some are standards for querying CD-ROMs or as intermediate languages to query library systems. • Not intended for human use = refer as protocols and not languages.
Query Protocols • Z39.50 –query bibliographical information using a standard interface between the client and the host database manager which is independent of the client user interface and of the query database language at the host. Originally used for bibliographical information based on MARC format. • WAIS – Wide Area Information Service – popular before Web – network publishing protocol and can query databases through the Internet.
Protocols for CD-ROM • Allows for flexibility in data communication between primary information providers and end users. • Significant cost savings - allows access to a variety of information without the need to buy, install, and train users for different data retrieval applications. • 3 protocols has been recommended : • CCL (Common Comand Language) • CD-RDx (Compact Disk Read only Data exchange) • SFQL (Structured Full-text Query Language)
QUERY OPERATIONS • Users - difficult to formulate queries which are well-designed for retrieval purposes because they do not know the collection make-up and the retrieval environment. • Web search engines – users spend a lot of time reformulating their queries to get effective retrieval. • First query formulation – retrieve documents and examine for relevance - construct new improved query formulations - retrieve documents and examine for relevance - process is repeated until the user is satisfied.
QUERY OPERATIONS • 2 processes involved • expanding the original query with new terms • reweighting the terms in the expanded query. • 2 ways of improving initial query formulation • approaches based on feedback information from the user • approaches based on information derived from the set of documents initially retrieved (called the local set of documents)
User Relevance Feedback • Most popular query formulation strategy. • User is presented with a list of retrieved documents, examines them, and marks those which are relevant. • Only the top 10 or 20 ranked documents need to be examined. • Separates into relevant and non-relevant. • Select important terms attached to the retrieved and relevant documents only, and enhance importance of terms in new query formulation.
User Relevance Feedback • Expect new query will move towards the relevant documents and away from the non-relevant ones. • Advantages : • Protects the user from the details of the query reformulation process because all the user has to do is reuse the terms • Breaks down the entire search process into a sequence of small steps which are easier to grasp. • Provides a control process designed to emphasis some terms and deemphasis others.
Automatic Local Analysis • Documents retrieved for a given query are examined immediately to determine terms for query expansion. • Similar to relevance feedback cycle but done without the assistance of the user – automatic. • Local feedback strategies are based on expanding the query with terms correlated to the query terms = local clusters built from local documents set.