1 / 30

Modeling and Managing Content Changes in Text Databases

Modeling and Managing Content Changes in Text Databases. Alexandros Ntoulas UCLA. Panos Ipeirotis New York University. Junghoo Cho UCLA. Luis Gravano Columbia University. Metasearchers Provide Access to Text Databases. Large number of hidden-web databases available

ortez
Download Presentation

Modeling and Managing Content Changes in Text Databases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling and Managing Content Changes in Text Databases Alexandros NtoulasUCLA Panos Ipeirotis New York University Junghoo ChoUCLA Luis GravanoColumbia University

  2. Metasearchers Provide Access to Text Databases • Large number of hidden-web databases available • Contents not accessible through Google • Need to query each database separately Broadcasting queries to all databases not feasible (~100,000 DBs) thrombopenia Metasearcher PubMed NYTimesArchives USPTO Panos Ipeirotis – New York University

  3. Metasearchers Provide Access to Text Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher   ? PubMed NYTimesArchives USPTO ... thrombopenia 26,887 ... ... thrombopenia 42 ... ... thrombopenia 0 ... Panos Ipeirotis – New York University

  4. Extracting Content Summaries from Text Databases For hidden-web databases (query-only access): • Send queries to database • Retrieve top matching documents • Use document sample as database representative For “crawlable” databases: • Retrieve documents by following links (crawling) • Stop when all documents retrieved Content summary contains: • Words in sample (or crawl) • Document frequency of each word in sample (or crawl) PubMed (11,868,552 documents) Word #Documents aids 123,826 cancer 1,598,896 heart 706,537hepatitis124,320thrombopenia 26,887 … Panos Ipeirotis – New York University

  5. Never-update Policy • Current practice: construct summary once, never update • Extracted (old) summary may: • Miss new words (from new documents) • Contain obsolete words (from deleted document) • Provide inaccurate frequency estimates NY Times(Mar 29, 2005) Word #Docs … NY Times(Oct 29, 2004) Word #Docs … • tsunami(0) • recount 2,302 • grokster 2 • tsunami 250 • recount (0) • grokster 78 Panos Ipeirotis – New York University

  6. Research Challenge Updating summaries is costly! Challenge: • Maintain good quality of summaries, and • Minimize number of updates • If summaries do not change  Problem solved! • If summaries change  Estimate rate of change and schedule updates Panos Ipeirotis – New York University

  7. Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University

  8. Data for our Study: 152 Web Databases • Randomly picked from Open Directory • Multiple domains • Multiple topics • Searchable (to construct summaries by querying) • Crawlable (to retrieve full contents) www.wsj.com, www.intellihealth.com, www.fda.gov, www.si.edu, … Panos Ipeirotis – New York University

  9. Data for our Study: 152 Web Databases Study period: Oct 2002 – Oct 2003 • 52 weekly snapshots for each database • 5 million pages in each snapshot (approx.) • 65 Gb per snapshot (3.3 Tb total) For each week and each database, we built: • Complete summary (by scanning all pages) • Approximate summary (by query-based sampling) Panos Ipeirotis – New York University

  10. Measuring Changes over Time • Recall: How many words in current summary also in old (extracted) summary? • Shows how well old summaries cover the current (unknown) vocabulary • Higher values are better • Precision: How many words in old (extracted) summary still in current summary? • Shows how many obsolete words exist in the old summaries • Higher values are better Results for complete summaries (similar for approximate) Panos Ipeirotis – New York University

  11. Summaries over Time: Conclusions • Databases (and their summaries) are not static • Quality of old summaries deteriorates over time • Quality decreases for both complete and approximate content summaries (see paper for details) How often should we refresh the summaries? Panos Ipeirotis – New York University

  12. Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University

  13. Survival Analysis Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” • Initially used to measure length of survival of patients under different treatments (hence the name) • Used to measure effect of different parameters (e.g., weight, race) on survival time • We want to predict “time until next update” and find database properties that affect this time Panos Ipeirotis – New York University

  14. Survival Analysis for Summary Updates • “Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required) • Old summary changes at time tif: KL divergence(current, old) > τ • Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold Panos Ipeirotis – New York University

  15. Modeling Goals • Goal: Estimate database-specific survival time distribution • Exponential distribution S(t) = exp(-λt) common for survival times • λ captures rate of change • Need to estimate λ for each database • Preferably, infer λfrom database properties (with no “training”) • Intuitive (and wrong) approach: data + multiple regression • Study contains a large number of “incomplete” observations • Target variable S(t) typically not Gaussian Panos Ipeirotis – New York University

  16. “Censored” cases X X X X X Week 52, end of study Survival Times and “Incomplete” Data “Survival times” for a database week • Many observations are “incomplete” (aka “censored”) • Censored data give partial information (database did not change) Panos Ipeirotis – New York University

  17. S(t), best fit, using censored data X X X S(t), best fit, using censored data “as-is” S(t), best fit, ignoring censored data X X Using “Censored” Data X • By ignoring censored cases we get (under) estimates  perform more update operations than needed • By using censored cases “as-is” we get (again) underestimates • Survival analysis “extends” the lifetime of “censored” cases Panos Ipeirotis – New York University

  18. Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression • Uses effectively “censored” data (i.e., database did not change within time T) • Derives effect of database propertieson rate of change • E.g., “if you double the size of a database, it changes twice as fast” • No assumptions about the form of the survival function Panos Ipeirotis – New York University

  19. Cox PH Regression Results • Examined effect of: • Change-sensitivity threshold τ • Topic • Size • Number of words • Differences of summaries extracted in consecutive weeks • Domain (higher τ longer survival) (does not matter, except for health-related sites) (larger databases change faster!) (does not matter) (sites that changed frequently in the past, change frequently in the future) (details in next slide) Rate of change increases Rate of change decreases Panos Ipeirotis – New York University

  20. Baseline Survival Functions by Domain Effect of domain: • GOV changes slower than any other domain • EDU changes fast in the short term, but slower in the long term • COM and other commercial sites change faster than the rest Panos Ipeirotis – New York University

  21. Results of Cox PH Analysis • Cox PH analysis gives a formula for predicting the time between updates for any database • Rate of change depends on: • domain • database size • history of change • threshold τ By knowing time between updates we can schedule update operations better! Panos Ipeirotis – New York University

  22. Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University

  23. Deriving an Update Policy • Naïve policy: • Updates all databases at the same time (i.e., assumes identical change rates) • Suboptimal use of resources • Our policy: • Use change rate as predicted by survival analysis • Exploit database-specific estimates for rate of change Panos Ipeirotis – New York University

  24. Scheduling Updates With plentiful resources, we update sites according to their rate of change When resources are constrained, we update less often sites that change “too frequently” Panos Ipeirotis – New York University

  25. Scheduling Results • Clever scheduling improves quality of summaries (according to KL, precision and recall) • Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper) Panos Ipeirotis – New York University

  26. Updating Content Summaries: Contributions • Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases • Change frequency model: showed that database characteristics can predict time between updates • Scheduling algorithms: devised update policies that exploit “survival model” and use efficiently available resources Panos Ipeirotis – New York University

  27. Current and Future Work • Current: • Compared with machine learning techniques • Applied technique for web crawling • Future: • Apply survival analysis for refreshing db statistics (materialized views, index statistics, …) • Examine efficiency of survival analysis models • Create generative models for modeling database changes Panos Ipeirotis – New York University

  28. Thank you! (ありがとう) Questions? 質問か。 Panos Ipeirotis – New York University

  29. Related Work • Brewington & Cybenko, WWW9, Computer 2000 • Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003 • Coffman, J.Scheduling, 1998 • Olston & Widom, SIGMOD 2002 Panos Ipeirotis – New York University

  30. Measuring Changes over Time • KL divergence: How similar is the word distribution in old and current summaries? • Identical summaries: KL=0 • Higher values are worse Results for complete summaries (similar for approximate) Panos Ipeirotis – New York University

More Related