130 likes | 219 Views
Web Search APIs: The Next Generation. Panel Discussion at WWW 2009. Panelists. Nicholas Weininger, Google Jordi Ribas, Microsoft Vik Singh, Yahoo Andrei Broder, Yahoo Soumen Chakrabarti, IIT Bombay Plus, serious users of search APIs in the audience PLEASE FILL OUT THE SURVEY!
E N D
Web Search APIs: The Next Generation Panel Discussionat WWW 2009 http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Panelists • Nicholas Weininger, Google • Jordi Ribas, Microsoft • Vik Singh, Yahoo • Andrei Broder, Yahoo • Soumen Chakrabarti, IIT Bombay • Plus, serious users of search APIs in the audience PLEASE FILL OUT THE SURVEY! http://www.surveymonkey.com/s.aspx?sm=hTpnhl32_2bi1_2bZDqxRmgjzg_3d_3d … or the “shorter” URL below: http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
APIs today • Keyword search, phrases, word include/exclude • Vertical restriction by pages, sites, domains • Other predicates over structured metadata • Language, date, file type, other filters • Image and news search • Rank promotion/demotion • Geocodes, hCard, hCalendar, metadata, … http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
The high-level question Does a search API amount to • A structured wrapper around black box matching, selection, scoring and ranking functions? • Access paths to low-level data structures and indices like postings and page metadata? • Access to indices over not only text but also annotations of named entities at the level of token segments? http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
The high-level question Does a search API amount to … • An extended query language with more control over structure and schema matches • The above, plus a set of scoring, ranking and sampling libraries with open and precise semantics? • Plus sandboxed annotation, ranking and aggregation opportunities http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Expressing the query • Control over case, abbreviations, senses • +cat +rm +ls +pneumonia • Murine Pneumonia Virus: Seroepidemiological Evidence of Widespread ...Neutralisation of respiratory syncytial virus by cat serum. ... Pringle, C. R., Wilkie, M. L. & Elliott, R. M. (1985). A ... Richardson-Wyatt, L. S. ... • Control over lexical proximity • Same sentence, same para, same HTML block vs."..." • Combining proximity between matches http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Expressing the query • Embedding named entities and entity classes/types • Catalog of entity names and types? • Two-phase query protocol? • Snippet/sentence-level vs. document-level focus http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Low-level access primitives • Consistent document counts • count(w1 ∨ w2) ≥ count(wi) ≥ count(w1 ∧ w2) • count(superType) ≥ count(subType) ≥ count(instance) • (Cannot do this unless types, instances are structured) • Within a time window? For a given user account? http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Low-level access primitives • Sampling primitives • 5000 URLs/sentences/snippets uniformly at random from all that contain both a person name and ice hockey • Non-uniform samples using PageRank or ... ? • Corpus and snippet access • 10 tokens on each side of person name • With all annotations associated with tokens http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Control over scoring and aggregation • Queries over API vs. queries via textbox • Contributions from page text match, anchor text match, PageRank, site rank, spam rank, proximity credits ... • Appropriate for all/most API queries? • Rank pages, or • Snippets, token spans, entities, quantities, ...? http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Control over scoring and aggregation • Type=professor, words={"quantum computing", encryption} • Lexical distance from mention of instance of professor to "quantum computing", to "encryption" • Mentions of same entity on multiple pages • Type=duration, words={drive, Madrid, Barcelona} • How to aggregate evidence from multiple pages? • Evidence of relationships within snippets http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Points for discussion • Do search APIs already give features beyond wrapping regular search? • Usually there is a search API for internal use, and another, if any, for external use. What are the main differences between these? • What do real API users ask for? What of those are provided and what are not? Reasons? • Which advanced features mentioned above are practical? Which are not? Why? • Where is the incentive? http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/
Points for discussion • Indexing, ranking and sampling algorithms needed to support search APIs today. • Ditto, if we are to extend to a more structured language and other features you or I suggest. • What can/should API users expect two years out re. evolution of APIs. • APIs for ad engines --- similarities and differences. http://soumen.cse.iitb.ac.in/~soumen/doc/www2009/