340 likes | 378 Views
This paper explores the application of Genetic Algorithms (GA) for optimizing queries in Internet Information Retrieval systems. It delves into the implementation of GA in search for relevant documents to user queries, focusing on recall and precision measures. The study showcases the integration of GA operators such as Selection, Fitness function, Crossover, and Mutation to enhance query optimization and retrieval efficiency. It also examines the state of the art in evolutionary learning of Boolean queries and features an innovative Boolean query reformulation interface for Information Retrieval. The text elaborates on chromosome encoding, tree structure representation, fitness function evaluation, and selection operators for genetic programming in improving search performance in IR. Furthermore, it discusses the potential of GA in revolutionizing query optimization techniques for efficient information retrieval systems.
E N D
Query Optimization by Genetic Algorithms Suhail Owais, Pavel Kromer, Vaclav Snašel Department of Computer Science, VŠB-Technical University of Ostrava, 17. listopadu 15, Ostrava - Poruba, Czech Republic
Outline • Introduction • Information Retrieval (IR) • Genetic Algorithms (GA) • Optimization • State of art • IR and GA • Experiments • Conclusion • Future Work
Information Retrieval • In principle, Suppose there are set of documents and a person (user of these documents), the user formulates a question (request or query) to which the answer is a subset of documents satisfying the information need expressed by his question “Relevant documents”. • Searching for information in documents, for document in collection of documents, for metadata in documents, … • Searching will be in databases, or in hypertext networked databases Internet or intranet.
Information Retrieval System - IRS • IRS concerned • with responding to the requests of users queries for the information seeking text. • with retrieve all relevant documents to user query from a collection of documents, with retrieving some of non-relevant as less as possible.
Retrieved - Relevant Documents to the user Query Collection of Documents Relevant Doc. Relevant Retrieved Doc. Retrieved Doc.
IR Evaluation The most Measuring performance of retrieval effectiveness are: • Precision ”the percentage of the retrieved documents that are relevant to the user query” • Recall ”the percentage of the relevant documents that are retrieved”
Genetic Algorithm • GA used Darwinian Evolution to extract optimization strategies nature uses successfully and transform them for application in mathematical optimization theoryto find the global optimum in defined phase space • GA are used in IR problems specially in optimizing of a Boolean query. • GA operators: Selection, Fitness function, Crossover, and Mutation.
GA Flowchart Diagram Contents Condition Satisfied Yes Optimized Query Initialize Population Encoding Evaluate Fitness's No Regenerate New Offsprings End Start Selection Crossover Mutation
Optimization • The procedure or procedures are used to make a system or design as effective or functional as possible, especially the mathematical techniques involved. • Is the process of modifying a system to improve its efficiency. The system can be a single computer program, a collection of computers or even an entire network such as the Internet.
State of the art 1 Contents Evolutionary Learning of Boolean Queries by Multiobjective Genetic Programming; • Authors: Cordon et al., Springer-Verlag GmbH 2002 • Subject: Automatic derivation of Boolean queries, by incorporating a Pareto-based multiobjective evolutionary approach, MOGA, into genetic programming technique. • Notes: • A query represented as a parse tree with maximum of 20 nodes. • Boolean operators used are AND, OR and NOT. • Maximum number of documents is 1400. • Result: The proposed approach has performed appropriately in seven queries of the well known Cranfield collection in terms of absolute retrieval performance and of the quality of the obtained Paretos.
State of the art 2 Contents An Appropriate Boolean Query Reformulation Interface for Information Retrieval Based on Adaptive Generalization • Authors: Yoshioka et al., WIRI 2005, In Conjunction with IEEE 2005, Tokyo Japan • Subject: Implement a user query interface that supports reformulation of IR queries by using abstract concepts. • Notes: • IR interface uses small numbers of query terms and concept categories with Boolean expression. • Reformulate a Boolean query by using only words that exist in the original query. • Boolean operators used are AND, and OR. • Result: Proposed a new IR interface with Boolean query reformulation (ABRIR-AG). Find complementary query terms that exist in relevant documents and reformulate Boolean query formulas to clarify the information need. ABRIR-AG : Appropriate Boolean query Reformulation for IR- Adaptive Generalization
IR and GA • Collection or set of Documents • Terms for Document di • Weighting function 1 0 W2 Not in Document d2 W8 IN Document d2
Chromosome Encoding • A query; combination from set of terms and set of Boolean operators • Set of queries will beencoded to be chromosomes for genetic programming in prefix form such as (w2 OR w6) AND (w9 AND w3) Prefix AND (OR w2w6) (AND w9w3) (w3 AND w4) XOR ((w5 AND w6) OR w8) Prefix XOR (AND w3w4) (OR (AND w5 w6) w8)
Tree Structure Representation XOR (AND w3w4) (OR (AND w5 w6) w8) AND (OR w2w6) (AND w9w3)
Fitness function • Recall and Precision functions are used to Evaluate Chromosomes. Selection Operators • From the population of chromosomes, the best two chromosomes depending on the highest fitness values for precision or recall measures will be selected. • rd : the relevance of document d (1 for relevant and 0 for nonrelevant), • fd : the retrieved document d (1 for retrieval and 0 for nonretrieval), and • α and β are arbitrary weights; added specially to precision fitness function.
Crossover Operator • Chose Randomly one node position in each Tree to be exchanged OR 4 OR 1
Exchange Sub trees Created two New Offsprings
Mutation Operator Randomly will change one of the Boolean logical operators to another and the position randomly chosen AND , OR , XOR AND , XOR AND No one select, SO no mutation will be done over this offspring AND 4
Experiments • Implementation for our Genetic Program was tested under the following conditions and limitations:- • Two sets of queries that represent in a tree prefix forms used as two different initial populations • Boolean model of a collection of documents • Different Collections of documents • User query / request w8OR w2
Initial Populations • The two initial population differs by containing sub queries, so • Initial Population 1 contain sub query • w8 AND w2 • Initial Population 2 contains sub queries • w8 AND w2 • w8 OR w2 • w8 XOR w2 Initial Population 2 Initial Population 1
Variables initialization • Crossover probability value 0.8 • Mutation probability value 0.2 • Population size (number of chromosomes) 8 • Maximum number of generations 50. • α 0.25 • β 1.0
Document Collections • Three different document collections with variant number of words and documents.
Notes on limitations • Single point for Crossover • Mutation operator applied only over Boolean operators AND, OR or XOR. • Fitness operator must be defined in input data to be:- • PrecisionFitnessor • RecallFitness. • maximum value for PrecisionFitness = α + β; • so It may be grater than one ( > 1 ) • it can not be interpreted as the probabilityof retrieving relevant document.
Experiments • Set of experiments done over three test cases; depends on:. • Initial Population used • Initial Population 1 OR • Initial Population 2. • Fitness function used • PrecisionFitness OR • RecallFitness. • Collection used • Collection 1 OR • Collection 2 OR • Collection 3.
Experiments Results Using IP. 1 IP : Initial Population, FF: Fitness Function , R : Recall , P :Precision, V: Value
Experiments Results Using IP. 2 IP : Initial Population, FF: Fitness Function , R : Recall , P :Precision, V: Value
Precision and Recall Diagrams Collections
Precision and Recall Diagrams Initial Populations
Conclusions • The final population contains set of individuals that have same fitness values • one randomly chosen to be an optimized query. • Because of selection queries with different sub queries similar to the user query that increase the quality of the initial population selected • this obtained better results • Especially when precision was used as fitness measure and experiment was done over largest collection, the fitness values of recall in final population were low. • in many experiments mostly all members of population reached the maximum values of precision and recall before reaching given number of generations.
Future works • Use more of unweighted Boolean operators like ( ADJ, and OF) operators • Mutation operates over all Boolean operators (AND, OR, XOR, ADJ, OF, and NOT) • Try to improve selection method for choosingthe best individual from a set of queries with equal values of precision or recall. • Appling of fuzzy theorem approach in this problematic - Use weights for terms in documents instead of Boolean weights.
Thanks for your attention Suhail Owais : suhailowais@yahoo.com