150 likes | 270 Views
Information Retrieval in Department 1. Visit of the Scientific Advisory Board Saarbrücken, June 2 nd – 3 rd , 2005. Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbr ücken, Germany. How it got started …. I shifted from formerly very theoretical work …
E N D
Information Retrieval in Department 1 Visit of the Scientific Advisory BoardSaarbrücken, June 2nd – 3rd, 2005 Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany
How it got started … • I shifted from formerly very theoretical work … • … to information retrieval topics • Over time a number of PhD/Master/Bachelor students joined in … RegisNewo ChristianKlein BenediktGrundmann IngmarWeber DanielFischer ChristianMortensen ThomasWarken … and a lot ofinteraction with Gerhard Weikum's group JosianeParreira DebapriyoMajumdar
What we are doing … • Motivation • even basic retrieval tasks are still far from being solved satisfactorily, e.g. searching my Email • Two main research areas in the past 2 years • Concept-based retrieval • Searching with Autocompletion • This presentation • main idea behind these areas • lots of demos and examples • highlight two results
a query a document expressed in terms Concept-Based Retrieval Hawaii, 2nd June 2004 Dear Pen Pal, I am writing to you from Hawaii. They have got internet access right on the beach here, isn’t that great? I’ll go surfing now! your friend, CB Equally dissimilar to query!
a query query expressed in concepts document expressed in concepts a document expressed in terms Concept-Based Retrieval
a concept expressed in terms document expressed in concepts a document expressed in terms Concept-Based Retrieval
Concept-Based Retrieval ● matrix multiplication
Concept-Based Retrieval ● matrix multiplication Finding concepts = approximate low-rank matrix decomposition The approximation actually adds to the precision
A Concrete Example • 676 abstracts from the Max-Planck-Institute • for example: We present two theoretically interesting and empirically successful techniques for improving the linear programming approaches, namely graph transformation and local cuts, in the context of the Steiner problem. We show the impact of these techniques on the solution of the largest benchmark instances ever solved. • 3283 words (words like and, or, this, … removed) • abstracts come from 5 departments: Algorithms, Logic, Graphics, CompBio, Databases • reduce to 10 concepts
logic / logics relatedness voronoi / diagram logic / voronoi 0 200 400 600 0 200 400 600 0 200 400 600 number of concepts number of concepts number of concepts How many concepts? Bast/MajumdarSIGIR 2005 • Implicitly, the matrix decomposition assigns a relatedness score to each pair of terms →every fixed number of concepts is wrong!
logic / logics relatedness voronoi / diagram logic / voronoi 0 200 400 600 0 200 400 600 0 200 400 600 number of concepts number of concepts number of concepts How many concepts? Bast/MajumdarSIGIR 2005 • Implicitly, the matrix decomposition assigns a relatedness score to each pair of terms we instead assess the shape of the curves!
Searching with Autocompletion • An interactive search technology • suggests completions of the word that is currently being typed • along with that, hits are displayed (for the yet to be completed query) best understood by example and you can try it yourself via the new MPII webpages
Useful in many ways • Learn about formulations used in the collection • e.g. "guestbook" • Minimum of information required • e.g. people's names • Gives stemming functionality (without stemmer) • e.g. "raghavans", "raghavan3", … • Gives error-correction functionality (without error-correction) • e.g. "raghvan", "ragavan", … • Database-like queries • e.g. publications by Kurt Mehlhorn all this with a single functionality no dictionary, no training, readily applicable to any collection
The core algorithmic problem • Given • a set of documents D(the hits of the preceding part of the query) • a range of words W(all completions of the last word the user has started typing) • Compute • the subset of documents D' ⊆Dthat contain at least one word from W • the subset of words W' ⊆Wthat occur in at least one document of D • typically |W'| << |W| D = 17, 23, 48, 116, …
The core algorithmic problem • Given • a set of documents D(the hits of the preceding part of the query) • a range of words W(all completions of the last word the user has started typing) • Compute • the subset of documents D' ⊆Dthat contain at least one word from W • the subset of words W' ⊆Wthat occur in at least one document of D • typically |W'| << |W| D = 17, 23, 48, 116, … Bast/Mortensen/Weber ~|W'|time per query Ordinary Inverted Index ~|W| time per query