1 / 33

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

Pete Bohman Adam Kunk. TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al. What is real-time search?. What do you think as a class?. Real-Time Search.

nerita
Download Presentation

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets SIGMOD ‘11 C. Chen et al

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pete Bohman Adam Kunk TI: An Efficient Indexing Mechanism for Real-Time Search on TweetsSIGMOD ‘11C. Chen et al

  2. What is real-time search? What do you think as a class?

  3. Real-Time Search • Definition: A search mechanism capable of finding information in an online fashion as it is produced.

  4. Real-Time Search • In terms of real-time search, what does “online” mean? • Online means that a constant stream of input data is handled as it enters the system, contrary to batch processing • Bing Social Search

  5. Real-Time Search Input Data • Example of what kind of input data is considered for real-time search systems: • twittervision

  6. Real-time content • Microblogging - Entirely new type of data • Short temporal life span • Little to no context • Simple ideas, fast reporting of events • Metadata: time, location, social links • Less factual, more opinionated • Static posts • Furious input rate • Often no hyperlink structure, few traditional ranking factors

  7. Real-time vs. Conventional Search • Conventional Search Ranking • Relevance • Authority • Real-time Search Ranking • Relevance • Temporal immediacy • Popularity

  8. Real-time vs. Conventional Search • Conventional search input • Crawl the web periodically and update index • Web documents evolve • Incapable of crawling and indexing the entire web in real-time • Real-time search input • Stream of data. • No need to poll since the posts are static • What can we do with real-time search engines?

  9. Query analysis • Collecta real-time search engine • Analyzed ~1 Million queries • Continuous Queries • Monitor events by frequently resubmitting the same query • Different query categories

  10. Value of real-time search • The estimated value of real-time search is around $33 Million • Value derived from types of queries entered in real-time search systems • Utilized adwords to determine worth of keywords appearing in queries

  11. Applications of real-time search • TwitterStand: Real-time news reports • Crowd sourcing of first hand reports • Example: Coverage of MJ’s death

  12. Applications of real-time search • Real-time alert systems • Leverages tweet metadata (time, location) to raise alerts • Earthquake localization based on tweets

  13. Twitter Real-Time Alerts USGS Twitter Earthquake Detector

  14. Difficulties of Real-Time Search • Two factors: • Efficient indexing in order to provide for fast results • Effective ranking in order to return relevant results

  15. Indexing Background • RDBMS Indexing • Indexes built on columns commonly used in queries • Improves the speed of retrieval operations • Conventional Search (Inverted) Indexing • Crawl the web for documents • Map keywords to documents containing those key words • Non structured data • If a document does not exist in the index, it will not appear in query results

  16. Real-Time Search Indexing • Index stream of data • Map keywords to tweets containing those keywords • Challenge • Processing the stream in a timely manor • 5,000 tweets per second

  17. TI Indexing • Not feasible to index every incoming tweet immediately • Selective indexing based on results that are most likely to appear in queries • Distinguished tweets indexed in real-time • Noisy tweets indexed by batch process

  18. TI Tweet Classification • Observation • Users are only interested in top-K results for a query • Distinguished tweets • Tweet that belongs in the top-K result set of previous query • Noisy tweet • Those tweets not appearing in the top-K results for any of the systems previous queries

  19. TI Indexing • Must limit the size of the query set • 1.6 Billion twitter queries per day

  20. Query set optimization • Observation • 20% of queries represent 80% of user requests • Therefore • Zipf’s distribution used statistically limit the number of queries tweets were compared against

  21. Real-Time Search Ranking • How does ranking differ from traditional web ranking? • There are no social relationships in traditional web pages • Typical web search engines rank based on links to a site, and links from a site • Website links are not the same as social networking links

  22. Real-Time Search Ranking • Ranking is not necessary in RDBMS systems • RDBMS systems do not favor certain data over others based on select criteria • RDBMS systems rank all data contained in the database the same essentially

  23. TI Ranking • Ranking function comprised of: 1) User’s PageRank • Combination of user weight (defaulted to 1) and how many followers they have (popularity) 2) Timestamp (self-explanatory) 3) Similarity between tweet and the query

  24. TI Ranking • Ranking function also comprised of: 4) Popularity of the topic • Determined by large tweet trees • Popularity of tree is equal to the sum of the U-PageRank values of all tweets in the tree Tweet Tree Structure

  25. TI Ranking Comparison Vs. Time Rank TI Rank

  26. What are others doing?

  27. What are others doing? • Facebook • Real-Time Feed

  28. Implications/Conclusion • Real-time search engines must provide: • “Online” algorithms to handle constant input • Relevant search results • Results of a query are no longer static

  29. Implications/Conclusion • TI makes use of two concepts in their real-time search of Twitter: • Selective Indexing • Form of partial indexing, can’t afford to index every incoming tweet due to large volume of input • Ranking • Ranking is a known technique, but microblogging applications provide new ranking algorithms

  30. References • TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets • http://www.comp.nus.edu.sg/~ooibc/sigmod11ti.pdf • Real Time Search User Behavior • http://faculty.ist.psu.edu/jjansen/academic/jansen_real_time_search.pdf • TwitterRank: Finding Topic-Sensitive Influential Twitterers • http://ink.library.smu.edu.sg/cgi/viewcontent.cgi?article=1503&context=sis_research • Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors • http://ymatsuo.com/papers/www2010.pdf • TwitterStand: News in Tweets • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.148.1477&rep=rep1&type=pdf • Learning Effective Ranking Functions for Newsgroup Search • http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.5556&rep=rep1&type=pdf • TwitterSearch: A Comparison of Microblog Search and Web Search • http://www.stanford.edu/~dramage/papers/twitter-wsdm11.pdf • TwitterVision • http://twittervision.com/ • Bing Social • http://www.bing.com/social • Reak tune search on the web: Queries, topics, and economic value • http://collecta.com/RealTimeSearch.pdf

  31. Discussion Questions • 1) What do you think is the most innovative technique in the TI approach that led to real-time microblog search results?

  32. Discussion Questions • 2) Given the partial indexing optimization provided in the paper, how do you think Google could optimize their indexing algorithm in order to capture the newest content on the web?

  33. Discussion Questions • 3) TI makes use of a ranking function in order to select tweets based on various user characteristics. What would you change about the ranking function, if anything?

More Related