260 likes | 358 Views
Search Strategies based on cluster-based indexing and retrieval www.sophiasearch.com. Our Philosophy of Search. A document collection can be viewed as consisting of many hundreds of thousands of documents (typical in a medium size enterprise)
E N D
Search Strategiesbased on cluster-based indexing and retrievalwww.sophiasearch.com
Our Philosophy of Search • A document collection can be viewed as consisting of many hundreds of thousands of documents (typical in a medium size enterprise) • Subgroups of documents are related to each other based on their general themes as discovered clusters (SOPHIA1,2). • These themes can be further broken down into individual topics as sub-clusters • By automatically discovering themes present in the collection and breaking them down into topics we can create intuitive groupings of “semantically” similar documents and present these to users. • We provide a topical overview of the structure of the collection that enhances browsing
Our Philosophy of Search • During search a theme can be viewed as consisting of one or more related topics. Themes can be accessed from the theme panel view of the collection. • Each topic contains one or more documents relevant to that topic (and obviously the theme itself). Documents are accessed via the topic panel view of the collection. • Users browse from theme level, to topic level and then choose documents that are relevant. • Users have varying search requirements. We believe we should provide tools to facilitate these. Therefore we have 3 main search scenarios – Focused Search, Blanket Search and Query by Example.
General Overview of Search • Irrespective of which type of search scenario you use, type terms into the query panel. • You will be presented with a list of themes relevantto your search terms (Theme panel view) • Using the themedescriptions presented, click on the one you are most interested in. This takes you to the topic panel • This provides an overview of the theme’s topicsand a list of the documents belonging to the most relevant topic. • A document can be clicked on and read (Document panel view) or a new topic clicked to examine the documents it contains • At any time you can go back to the original themedescriptions to browse another theme or try another query
Focused and Blanket Search Facility for both a specific and speculative search mechanism Focused Search is ideal for finding specific information, when you know, what you are looking for. Blanket Search is suitable to find general (diverse) topics related to search terms, to facilitate exploration and the selection of the most relevant ones for deeper analysis. Query by example allows a search based on a given document rather than a key word query.
Blanket Search • Compare a query and cluster’s centroid by considering query as a probability distribution of terms using JS-divergence. • Rank clusters according to increasing divergence • Consider extension for adding diversity measure to ranking so that different themes relevant to the query are foremost also.
Select the Relevant Theme 1) Read theme names and descriptions 2) Scroll to read them all 3) Select the relevant theme
Focused Search • Use inverted index of terms/phrases to identify documents that are relevant to the query based on term frequency of terms within documents where the terms occurs. • Rank clusters according to highest proportion of relevant documents within clusters
How to explore From the Topic Panel Once you have selected a theme for further analysis you are presented with a matrix of topics. Each topic has a size, a description and a colour. The size gives a visual indication of the relevancy of the particular topic to the query terms enabling the user to very quickly focus in on the best topics for document retrieval. The most relevant topics are presented on the top left hand side of the matrix. The description helps you understand the content of the topic
How to explore From the Topic Panel Topics of the same colour are closely related in content. We refer to topics of the same colour as presenting similar aspects of the theme to the user. If the colours of 2 topics is different we say they are on slightly different aspects of the theme. Initially the most relevant topic in a theme is automatically selected. The documents it contains are listed on the right hand side of the screen. They can be viewed 10 at a time. Based on topic descriptions, the user may want to click on other topics within this theme. This action displays the documents of the newly selected topic. Documents have titles and summaries associated with them (based around the original query terms used for search). Using this information a document can be selected and clicked on to display its contents.
How to explore From the Theme Panel Once you have entered your search terms and have a list of relevant themes presented (as in previous slide) you can use the functionality offered by the Theme panel view to explore further. Use the Theme descriptions (LHS) combined with the Topic descriptions (RHS) to determine the most useful Theme for further analysis. Click on a Theme to get a more detailed overview of the different Topics it contains.
Topic Panel #2 Colour indicates aspect. Current document Page Document summary, with keywords highlighted Selected Topic. Its size indicates its relevance to the query. A bigger Topic is more relevant to specified query. Click to display full document.
Focused Search A query is entered as with blanket search – just make sure the focused search radio button is active. Use the – character to indicate words you want to exclude. Eg. The following query Rugby –Ulster –Ravenhill Returns clusters that have the highest proportion of documents that contain rugby but not Ulster or Ravenhill. By excluding – you will get back documents that contain all 3 terms Themes are presented using the same interface as before. By clicking on a theme, the topics it contains are presented as before
Make the First Query 3) Press the “Search” button 2)Type a query 1) Select Focused Search
Theme Panel View 1) select the relevant theme
Query by Example • This is a powerful and unique feature of our search engine • It enables you to present an example document or portion thereof as a query to retrieve topically similar documents • Firstly create a text file containing the content you want to use as your exemplar document (use notepad under accessories to paste content into, then save to disk) • Click the query by example radio button • Use the browse button to select the location of the newly created text document • Then press search • Results are presented using the now familiar theme based approach where the theme that contains the most documents related to the concepts of the query document are ranked highest
Query by Example 1) Press to locate query text file on your local disk 2) Directory and name of query document 3) Click to find topically similar documents
Query By Example (1) Query Document He is being hailed this morning as a tragic figure who might just have stepped from a Wagnerian opera. The German papers today expressed their sympathy with Jens Lehmann, whose "moment of madness" in the Champions League final between Arsenal and Barcelona led to him being sent off in the 18th minute, ultimately leading to Arsenal's 2-1 defeat. The papers all agree that Lehmann deserved to be punished after plucking at the boot of Barcelona's Samuel Eto'o. But there was criticism also in Germany of the Norwegian referee's decision to give Lehmann the red card. "The cleverest decision of referee Terje Hauge would have been to give the advantage and allow the goal for Barcelona - and to have warned Lehmann, the German number one," the Berliner Zeitung wrote this morning. It added that Lehmann's sending off "decimated" his team, a fate that Arsenal had not really "deserved".
Topics in Highest Ranked Theme Most relevant topic to query Most conceptually relevant documents to query within best topic
References • 1 Niall Rooney, David Patterson, Mykola Galushka, Vladimir Dobrynin: A scaleable document clustering approach for large document corpora. Inf. Process. Manage. 42(5): 1163-1175 (2006) • 2 Vladimir Dobrynin, David W. Patterson, Mykola Galushka, Niall Rooney: SOPHIA: an interactive cluster-based retrieval system for the OHSUMED collection. IEEE Transactions on Information Technology in Biomedicine 9(2): 256-265 (2005)