190 likes | 382 Views
Visualization in Text Information Retrieval. Ben Houston Exocortex Technologies www.exocortex.org Zack Jacobson CAC. The Starting Goal. The Original Project Goal Can we come up with a graphical way of representing search results in a way that is superior to text only displays?
E N D
Visualization in Text Information Retrieval Ben Houston Exocortex Technologies www.exocortex.org Zack Jacobson CAC
The Starting Goal • The Original Project Goal Can we come up with a graphical way of representing search results in a way that is superior to text only displays? Other VITA project members: Els Goyette Olivier Dagenais Sarah Rosser
A Text IR Interaction Model Query(s) User Interface IR SearchProxy Results DocumentCollection Browsing
A Quantification of User Needs • Specific resource. • Has a particular book, web page in mind. • Specific information. • Needs a book on a particular subject matter which contains particular information. • Specific knowledge. • Needs to know about an unfamiliar subject matter.
A Quantification of User Needs • Specific resource. • Specific information. • Specific knowledge. IR is good at these tasks.IMHO Visualization would be an unneeded hindrance. Maybe this is an opportunity here.There is a lot of information to shift through.
Formalizing Knowledge Search • There is a hypothetical set of relevant documents which the user would like: Dr • The user attempts to get the set Dr through initially guess and refining a series of: q1, q2, … qn. • We can think of it as iterative evolutionary hill climber. • Serial sub goals of finding qn+1 such that P(Dr|qn+1) > P(Dr |qn) • Thus… How can we help the user maximize P(Dr|q) as quickly as possible?
Don’t forget… popular IR problems. • Difficulty in formulating effective queries. • Average number of terms per query is about 1.5. • Words do not have a 1:1 mapping to semantic concepts. • Determining the relevance ranking of an individual document. • Going past just words. • How do you deal with 1 billion documents? • Did you know its more than doubling every year? • Databases/indices of + 500 GB each.
Our efforts • 1st Try: Bar charts. (Even 3D bar charts!) • Naïve first attempts – we won’t mention those. • 2nd Try: Concept-document clustering in a information space. • Two prototypes: NetViz & AutoViz, more should be developed.
The Major “Neat” Features • Focus on concrete representation of the query. • Use data-mining techniques before visualization. • Visual summaries. • An active model for interaction. • Bridging the gaps between “serial” queries. • Widening / narrowing to get context.
Location, Color, Size, Shape • Show each concept ina meaningful spatial relationships. • Show the specific results positioned in relation to the concepts.
Display Intra-result Structure Clustering on implicit/latent trends
Visual Document Summaries Instead of Lets show intra-documentconcept co-occurrence
Exploring within a result set Highlighting and extracting subsets. Each document has a probability distribution amount the different clusters.
Exploring outside a result set (Slightly Hypothetical) • Present three things to the user • Where the user is. (The City) • What is at the location the user is at. (The Sights) • What are related/nearby places. (The Highways) There is a mockup of this available on my website: www.exocortex.org/~ben/trendanalysis2.html
Bridging Serial Queries • Instead of requiring a user to judge each query as a separate entity why not let a user see what changes in the results as they refine their query? • Currently we do serial searching with backtracking. • A potentiator for for non-serial methods of exploration in a (Bayesian) “concept space” network. • P(Dr|qn+1) > P(Dr |qn) P(Dr|f(qn+1,qn)) > P(Dr |f(qn))
Widening / Narrowing Scope • Allowing for interactive narrowing or widening of the display by filtering on document relevance.
Browser Integration… of course Seamless browserintegration. (hypothetical) Spawning of browsers.
Results and Predictions “Ad hoc” Results • Extracting / presenting intra-result set structure is extremely effective. • There is value breaking free from serial queries. • Provide landmarks and easy exploratory interaction models. More worked Needed. ??? • The current browser interface is really limiting. • The underlying engine is more critical than visualizations. • Visual document summaries need more work. • Overactive (hyperactive) interfaces are hard to learn. • Ben’s Future of Text IR • Visualization is usually a fix for insufficient data-mining / algorithm techniques (in text IR). • Intra-result set clustering works in text only displays too. It will be integrated into existing text search engines. • The metaphor of exploring information space it become more popular.
Hmm… Sturgeon tastes good. Want to try it? Download the prototypes!NetViz http://www.exocortex.org/netvizAutoViz http://www.exocortex.org/autoviz Comments? Email me! ben@exocortex.org