280 likes | 411 Views
Enhancing Internet Search Engines to Achieve Concept-based Retrieval. F. Lu, T. Johnsten, V. Raghavan, and D. Traylor. Agenda. Information on the Internet. Boolean Retrieval Model and the Internet. Personalized Search. Concept-Based Retrieval (RUBRIC / CS 3 ).
E N D
Enhancing Internet Search Engines to Achieve Concept-based Retrieval F. Lu, T. Johnsten, V. Raghavan, and D. Traylor
Agenda • Information on the Internet. • Boolean Retrieval Model and the Internet. • Personalized Search. • Concept-Based Retrieval (RUBRIC / CS3). • CS3 and Boolean Search Engines. • Deep Web Sources. • Current & Future Work.
Information on the Internet • Large volume. • Rapid growth rate. • Wide variations in quality and type.
Boolean Retrieval Model and the Internet • Most Internet search engines are based on the Boolean Retrieval Model. • Boolean Retrieval Model is relatively easy to implement. • Limitations: • Inability to assign weights to query or document terms. • Inability to rank retrieved documents. • Naïve users have difficulty in using
Personalized Search Personalized Results User Query Personalized Engine Query Processor User Profile General Profile Result Processor Query Augmentation Search Results Search Engine
Concept-Based Retrieval • Address shortcomings of Boolean Retrieval Model. • Search Requests specified in terms of concepts structured as rule-base trees.
Development of Rule-Base Trees (General) • Top-down refinement strategy. • Support for AND / OR relationships. • Support for user-defined weights.
Development of Rule-Base Trees (CS3) • Concept-Set Structuring System (CS3) • CS3 supports the creation, storage and modification of user-defined concepts • Post-processing of results of sub-queries • CS3 user-interface.
Evaluation of Rule-Base Trees (RUBRIC) • Run-time, bottom-up analysis. • Propagation of weight values (MIN / MAX). • Disadvantage of run-time analysis.
Evaluation of Rule-Base Trees (CS3) • Static, bottom-up analysis. • Construct Minimal Term Set (MTS). • Propagation of terms. • CS3 user-interface.
MTS-Minimal Term Set • A MTS for a topic is a set of terms such that if each term in the set appears in the document, the document would get a RSV larger than 0. If not, the RSV would be 0. • A topic could have more than one MTSs. • A user can choose from those MTSs to perform a search to his needs.
CS3 and Boolean Search Engines • CS3 is designed to interface with existing Boolean search engines. • U.S. Department of Energy’s “Information-Bridge” search engine. • U.S. Department of Transportation’s “National Transportation Library” search engine.
System Architecture Client (Java/ Applet ) CORBA CGI Server (JAVA) Server (JAVA/C++) JDBC DOE InfoBridge etc. … ORACLE
Information-Bridge and CS3 • Search request: Boolean Vs. Concept • Output: Non-Ranked Vs. Ranked. • Calculation of RSV: • Given a document D and a set S of MTS expressions satisfied by D, the RSV of D is equal to the sum of all the weights of S plus the maximum weight in S.
Information-Bridge and CS3 (Example) • Boolean search request (“Environmental Science Network” Form): • (“Hydrogeology” OR “Dnapl” OR (“Colloid*” AND “Environmental Transport”)). • Concept (CS3): • “Hydrogeology”. • Rule-Base Tree.
Deep Web Sources • Also referred to as hidden Web or invisible Web • Resides behind search forms in databases e.g. monster.com, louisiana1st.com, PubMed. • Web pages in deep Web are generated dynamically based on the submitted queries. • Not indexed by current search engines. Search engines index content on the surface Web.
Deep Web Sources and Concept-based Retrieval • Deep Web in terms of size and quality: Size (Deep Web) = 500 * Size (Surface Web) Quality (Deep Web) = 1000 * Quality (Surface Web) • Queries submitted at deep Web sources are more stable compared to queries submitted to search engines • So, naturally concept-based retrieval is more suitable for deep Web sources
Current and Future Work • Conduct experiments to evaluate effectiveness (future). • Investigate alternative methods to compute RSVs [KADR00, KDR01*]. • Learning edge weights through relevance feedback [KR00]. • Thesaurii based rulebase generation [KLR00].
Relevant URLs [LJRT99*] RaghavanHome Publications since 1991 www.allinonenews.com