300 likes | 312 Views
Modeling and Managing Content Changes in Text Databases. Alexandros Ntoulas UCLA. Panos Ipeirotis New York University. Junghoo Cho UCLA. Luis Gravano Columbia University. Metasearchers Provide Access to Text Databases. Large number of hidden-web databases available
E N D
Modeling and Managing Content Changes in Text Databases Alexandros NtoulasUCLA Panos Ipeirotis New York University Junghoo ChoUCLA Luis GravanoColumbia University
Metasearchers Provide Access to Text Databases • Large number of hidden-web databases available • Contents not accessible through Google • Need to query each database separately Broadcasting queries to all databases not feasible (~100,000 DBs) thrombopenia Metasearcher PubMed NYTimesArchives USPTO Panos Ipeirotis – New York University
Metasearchers Provide Access to Text Databases Database selection relies on simple content summaries: vocabulary, word frequencies thrombopenia Metasearcher ? PubMed NYTimesArchives USPTO ... thrombopenia 26,887 ... ... thrombopenia 42 ... ... thrombopenia 0 ... Panos Ipeirotis – New York University
Extracting Content Summaries from Text Databases For hidden-web databases (query-only access): • Send queries to database • Retrieve top matching documents • Use document sample as database representative For “crawlable” databases: • Retrieve documents by following links (crawling) • Stop when all documents retrieved Content summary contains: • Words in sample (or crawl) • Document frequency of each word in sample (or crawl) PubMed (11,868,552 documents) Word #Documents aids 123,826 cancer 1,598,896 heart 706,537hepatitis124,320thrombopenia 26,887 … Panos Ipeirotis – New York University
Never-update Policy • Current practice: construct summary once, never update • Extracted (old) summary may: • Miss new words (from new documents) • Contain obsolete words (from deleted document) • Provide inaccurate frequency estimates NY Times(Mar 29, 2005) Word #Docs … NY Times(Oct 29, 2004) Word #Docs … • tsunami(0) • recount 2,302 • grokster 2 • tsunami 250 • recount (0) • grokster 78 Panos Ipeirotis – New York University
Research Challenge Updating summaries is costly! Challenge: • Maintain good quality of summaries, and • Minimize number of updates • If summaries do not change Problem solved! • If summaries change Estimate rate of change and schedule updates Panos Ipeirotis – New York University
Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University
Data for our Study: 152 Web Databases • Randomly picked from Open Directory • Multiple domains • Multiple topics • Searchable (to construct summaries by querying) • Crawlable (to retrieve full contents) www.wsj.com, www.intellihealth.com, www.fda.gov, www.si.edu, … Panos Ipeirotis – New York University
Data for our Study: 152 Web Databases Study period: Oct 2002 – Oct 2003 • 52 weekly snapshots for each database • 5 million pages in each snapshot (approx.) • 65 Gb per snapshot (3.3 Tb total) For each week and each database, we built: • Complete summary (by scanning all pages) • Approximate summary (by query-based sampling) Panos Ipeirotis – New York University
Measuring Changes over Time • Recall: How many words in current summary also in old (extracted) summary? • Shows how well old summaries cover the current (unknown) vocabulary • Higher values are better • Precision: How many words in old (extracted) summary still in current summary? • Shows how many obsolete words exist in the old summaries • Higher values are better Results for complete summaries (similar for approximate) Panos Ipeirotis – New York University
Summaries over Time: Conclusions • Databases (and their summaries) are not static • Quality of old summaries deteriorates over time • Quality decreases for both complete and approximate content summaries (see paper for details) How often should we refresh the summaries? Panos Ipeirotis – New York University
Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University
Survival Analysis Survival Analysis: A collection of statistical techniques for predicting “the time until an event occurs” • Initially used to measure length of survival of patients under different treatments (hence the name) • Used to measure effect of different parameters (e.g., weight, race) on survival time • We want to predict “time until next update” and find database properties that affect this time Panos Ipeirotis – New York University
Survival Analysis for Summary Updates • “Survival time of summary”: Time until current database summary is “sufficiently different” than the old one (i.e., an update is required) • Old summary changes at time tif: KL divergence(current, old) > τ • Survival analysis estimates probability that a database summary changes within time t change sensitivity threshold Panos Ipeirotis – New York University
Modeling Goals • Goal: Estimate database-specific survival time distribution • Exponential distribution S(t) = exp(-λt) common for survival times • λ captures rate of change • Need to estimate λ for each database • Preferably, infer λfrom database properties (with no “training”) • Intuitive (and wrong) approach: data + multiple regression • Study contains a large number of “incomplete” observations • Target variable S(t) typically not Gaussian Panos Ipeirotis – New York University
“Censored” cases X X X X X Week 52, end of study Survival Times and “Incomplete” Data “Survival times” for a database week • Many observations are “incomplete” (aka “censored”) • Censored data give partial information (database did not change) Panos Ipeirotis – New York University
S(t), best fit, using censored data X X X S(t), best fit, using censored data “as-is” S(t), best fit, ignoring censored data X X Using “Censored” Data X • By ignoring censored cases we get (under) estimates perform more update operations than needed • By using censored cases “as-is” we get (again) underestimates • Survival analysis “extends” the lifetime of “censored” cases Panos Ipeirotis – New York University
Database Properties and Survival Times For our analysis, we use Cox Proportional Hazards Regression • Uses effectively “censored” data (i.e., database did not change within time T) • Derives effect of database propertieson rate of change • E.g., “if you double the size of a database, it changes twice as fast” • No assumptions about the form of the survival function Panos Ipeirotis – New York University
Cox PH Regression Results • Examined effect of: • Change-sensitivity threshold τ • Topic • Size • Number of words • Differences of summaries extracted in consecutive weeks • Domain (higher τ longer survival) (does not matter, except for health-related sites) (larger databases change faster!) (does not matter) (sites that changed frequently in the past, change frequently in the future) (details in next slide) Rate of change increases Rate of change decreases Panos Ipeirotis – New York University
Baseline Survival Functions by Domain Effect of domain: • GOV changes slower than any other domain • EDU changes fast in the short term, but slower in the long term • COM and other commercial sites change faster than the rest Panos Ipeirotis – New York University
Results of Cox PH Analysis • Cox PH analysis gives a formula for predicting the time between updates for any database • Rate of change depends on: • domain • database size • history of change • threshold τ By knowing time between updates we can schedule update operations better! Panos Ipeirotis – New York University
Outline • Do content summaries change over time? • Which database properties affect the rate of change? • How to schedule updates with constrained resources? Panos Ipeirotis – New York University
Deriving an Update Policy • Naïve policy: • Updates all databases at the same time (i.e., assumes identical change rates) • Suboptimal use of resources • Our policy: • Use change rate as predicted by survival analysis • Exploit database-specific estimates for rate of change Panos Ipeirotis – New York University
Scheduling Updates With plentiful resources, we update sites according to their rate of change When resources are constrained, we update less often sites that change “too frequently” Panos Ipeirotis – New York University
Scheduling Results • Clever scheduling improves quality of summaries (according to KL, precision and recall) • Our policy allows users to select optimally change thresholds according to available resources, or vice versa. (see paper) Panos Ipeirotis – New York University
Updating Content Summaries: Contributions • Extensive experimental study (1 year, 152 dbases): established the need to update periodically statistics (summaries) for text databases • Change frequency model: showed that database characteristics can predict time between updates • Scheduling algorithms: devised update policies that exploit “survival model” and use efficiently available resources Panos Ipeirotis – New York University
Current and Future Work • Current: • Compared with machine learning techniques • Applied technique for web crawling • Future: • Apply survival analysis for refreshing db statistics (materialized views, index statistics, …) • Examine efficiency of survival analysis models • Create generative models for modeling database changes Panos Ipeirotis – New York University
Thank you! (ありがとう) Questions? 質問か。 Panos Ipeirotis – New York University
Related Work • Brewington & Cybenko, WWW9, Computer 2000 • Cho & Molina, VLDB 2000, SIGMOD 2000, TOIT 2003 • Coffman, J.Scheduling, 1998 • Olston & Widom, SIGMOD 2002 Panos Ipeirotis – New York University
Measuring Changes over Time • KL divergence: How similar is the word distribution in old and current summaries? • Identical summaries: KL=0 • Higher values are worse Results for complete summaries (similar for approximate) Panos Ipeirotis – New York University