1 / 40

Web Warehouse : Non-Transparent Cache with Weak Storage Capacity Bound

Web Warehouse : Non-Transparent Cache with Weak Storage Capacity Bound. Yahiko Kambayashi, Kai Cheng Sinotaro Hirano Graduate School of Informatics Kyoto University, Japan. Motivation. Background. What Can Data Management Technologies Contribute ? Cache/Data duplication is important.

bell
Download Presentation

Web Warehouse : Non-Transparent Cache with Weak Storage Capacity Bound

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Warehouse:Non-Transparent Cache with Weak Storage Capacity Bound Yahiko Kambayashi, Kai Cheng Sinotaro Hirano Graduate School of Informatics Kyoto University, Japan

  2. Motivation

  3. Background What Can Data Management Technologies Contribute ? Cache/Data duplication is important x9/Year Internet Content Bandwidth Internet Backbone X1.4/Year Web Characteristics • Open • Everyone Can Publish Content Freely • No Centralized “Data Dictionary” • Biased Usage • Dynamically Changing “Hot-Spots” • The Web Contents Doubles Every 3-6 Months. Bandwidth is Only x1.4/Year  Increasing Gap Between Increasing Traffic and Available Bandwidth Gap

  4. Cache is Everywhere!! Caches in Multi-Tiered Web Architecture Client Caches Client Caches Proxy Caches Internet Front-End Caches Web Server Caches Application Server Mid-Tier Caches DB

  5. Major Results

  6. Assumptions on cache algorithms are no more true Cache and Web • Traditional Cache • Simple and Fast Algorithms (e.g., LRU) are Required • Strict Limitation on Storage Size • Transparency (You Cannot See the Contents for Efficient Use) • All the data are treated equally when retrieved • Web Environment • Complicated Algorithms are Permitted • Disks can be Used and Cache Size is not a Limiting Factor anymore • Data in the Cache can be used (Non-Transparent) • Most of cache storage are occupied by non-important data

  7. Cache+Database may be suitable for web applications Databases/Data Warehouses and Web • Traditional Databases • Large amount of selected data are stored following DB schema • Data are shared and user-friendly query interface is supplied • DB is designed by properties of data, not by applications. Application specific characteristics are handled by query processors • Query processor usually do not use statistics of past queries • Web Environment • Large amount of data should be shared • Dynamically Changing Hot Spots should be handled by Advanced Self-Organizing Structure (Composite Pages, Linked Pages) • Similar Contents Tend to be Accessed Subsequently • Usage data are important and should be shared Also to be used for definitions of dynamically changing priority

  8. Contribution of This Paper 1. Dynamically changing Strongly Biased Hot Spot DataSelf-Organizing Capability Priority determined by data usage of the past and popular topics. Topic Sensors for Detecting Global Hot Topics in the Web. 2. Handling of Link StructuresPriority is determined by logical web structure 3. Small reuse ratio in Web cache / Many similar topicsNon-transparency Use of DB systems for web cache 4. Priority based file organizationUnlike LRU (priority is determined at the last moment), the initial value of priority should be determined when web page is retrieved. Data with high priority will be located at fast access storages. 5. Application sensitive data management Store and Manage both Data and Usage Data (metadata) Usage/Popularity-Aware Queries

  9. Data Sources Selection DW Modeling DB Storage FS Retrieval Past Usage Patterns are not used in conventional DBMS Data Management Systems

  10. Priority-Based Data Selection Web Object Hierarchy Model Self-Organizing Storage Management Popularity-Aware Query/ Navigation System Overview • A New Data System, Called Capacity Bound-free Web Warehouse (CBFWW), • A Cache without Capacity Bounds Capable of Storing All Important Data • A Data ManagementSystem with Priority Decision and Usage Data Priority Decision Description Topic Sensor Storage Retrieval Self-Organization

  11. Overview of Web Cache

  12. Existing Researches Focus on • Getting More Hits (Collaborative cache, Caching Uncacheable) • Increasing Freshness of Each Hit (Consistency Management) History of Web Cache Research • Cache Algorithms (90s) • Replacement Algorithms, e.g., LRU, LFU, LRU-SIZE, SIZE, etc. • As Storage Space is No Longer a Limiting Factor --“Publish No More Papers on Cache Replacement Algorithms ” (Panel Discussion, 2001 Web Cache Workshop ) • Consistency Management e.g., Client Polling, Server Invalidation • Caching of Uncacheable Contents • E.g, Using Proxylets, Active Cache (P. Cao, et al ) • Collaborative Caching • Hierarchical Cache (e.g., Harvest Project )

  13. Characteristics of Web Cache

  14. Factors for Web Cache Evaluation Traditional Factor • Recency: The More Recently an Object was Used, the More Likely It will be Used Again New Factors • Popularity: The More Popular an Object has been, the More Likely It will Get More Accesses in the Future. • Size : Caching a Larger Object may Displace Many Smaller Ones • Update Frequency

  15. Algorithms for Web Cache

  16. Web Warehouse

  17. Architecture The Architecture of Web Warehouse Topic Sensor Web Requester (Proxy) Recommender Topic Manager Constraint Manager Data Analyzer Priority Manager Data/Usage Query Processor Storage Manager Version Manager Data/Usage • Memory • Disks • Tertiary Storage

  18. Data warehouse capability is required Most Contents In Web Caches Never Reused 70% of HTML Files Never Reused Zipf’s Distribution • Data Obtained From A Large ISP Kyoto-Inet • Only HTML Documents Considered

  19. Data for Web Warehouse

  20. lgPath= d1, d2, d3 Logical Document Title = Anch_text1+ Anch_text2+ title) Body Contents Corresponding to Logical Pages d1 Anch_text1 Hyperlink d2 Hyperlink Anch_Text2 d3 Body Title Frequently-Used Path Toward d3

  21. 2 1 1 0 1 Physical Documents • Container • Textual Content • Anchor • Holder Place for Other Media • Components • Media Files Other than Text • Use Counter Contentof(d) = < title, body>, for (physical) document d Both Container and Components are Called Raw Data

  22. Data Organization Organization of Web Data in CBFWW • Data Organization Based 0n Locality of Reference • Page Embedded ObjectsAccess to a Page Causes Embedded Components Accessed • Page Linked PagesAccess to a Page Enables Linked Pages More Likely to Be Accessed • Page Similar PagesAccess to a Page Entails Interest to Similar Pages Semantic Region (Topic) -- Cluster of Similar Logical Pages Logical Pages -- Frequently-Used Path to a Physical Page Physical Pages -- Composite Page: Container(1)+Components(M) Raw Data -- Undividable Web Objects (e.g., Files) 1 1 2 0 1

  23. Computation of Priority Priority for various storage levels

  24. Priority Decision When Retrieved • LRU determines the priority at the last moment • A Semantic Region (R) is a Cluster of Semantically Close Logical Documents. • Each Document Belongs to Exactly One Cluster • A New Document Belongs to a Cluster whose Centroid is the Nearest • The Number of Semantic Regions is Given. • Existed High Performance Single-Pass Randomized K-Median Clustering Algorithms can be Adopted (e.g LSEARCH).

  25. Topic sensor • Priority Decision by Global Popularity • Analysis of Data provided by a provider Kyoto I-net • Very popular web pages are influenced by news on TV and newspapers • Especially web pages related to some local events are accessed only during a short period of time • Priority by past usage is not enough • Topic Sensor finds important topics from news sites • Contents GraphOnly keywords are not enoughKeywords with co-occurrence relationships are expressed

  26. Similarity by Concept Graphs • Keywords Keyphrases • Co-occurrenceAssociation rules Concept Graphs for Extracting Topics, By Y. Lee. And Y. Kambayashi 2002

  27. 2 1 1 0 1 Priority Decision by Usage History -History of Keyword Usage Popular web search technique-Depending on the interval and selected patterns, priority values will be different-Various kinds of priority functions can be defined using past usage data. It can be dynamically modified depending on the occupation rate of storage. Freq Freq Freq Average Hot M W D M W D M W D [Going Down] [No Change] [Going Up] Freq Freq Freq Average New Obsolete Change of Topic Popularity w.r.t Usage Patterns M W D M W D M W D [No Change] [Obsolete Topic] [New Topic]

  28. Experiments and Prototype TOP 10 Search Results for “Sports” 2002 Jan.14 ~ Feb.14 (a)Usage-Blind Search (b) Usage-Aware Search 2002 World Cup Skiing Season * Gray:Disappeared Items** Red: New Items Local Baseball News

  29. Consistency Management • Consistency: Data in CBFWW Should Keep Up to Date with Data in Origin Sites. With Usage Data Available, Consistency Management Can be Done Adaptively Dependent on • Frequency of Updates : How Often the Data are Updated • Frequency of Reference : How Often the Data are Used • Time Interval of Reference: When the Data are Used (day or night) Similar to View Selection Problem Materialized view Computational view Updates References

  30. Storage Management Priority Management • Priorites Based On • Sizes (Raw Data and Physical Pages) • Recency (All Objects) • Frequency (All Object) • Link Structure Based Ranks (Physical Pages) • Importance of Topics Obtained from Topic Sensor • Priorities of Lower Level Objects Depend On Those on Upper Levels • Raw Data Can be Higher in Priority when Belonging to A High Priority Physical Page • Physical Pages Can be Higher in Priority when Belonging to A High Priority Logical Page • Logical Pages Can be Higher in Priority If Belonging to A High Priority Topics

  31. Storage Organization

  32. Storage Management Mappings Self-Organizing Storage Management • Adaptively Mapping Object Hierarchy to Storage Hierarchy • Mapping Based On Priorities of Data Objects • Data Migration to Higher Levels Not Cause to Delete Physical Data in Lower Storage Levels • Data in Main Memory have Exact Copies in the Disk. • Data in Disks have Backup Copies in the Tertiary Storage Priorities In All Levels Raw Data Physical Page Logical Pages Semantic Regions Storage Hierarchy

  33. Level of Details • Data in CBFW Can be Preprocessed to Provide Different Data Format and Level of Details to Users • E.g., If the Size of A is Very Big, We May not be Able to Store it at the same Storage device. • We can Generate A’, which only Contains Word/Phrase Information of A. Since A’ is Small, It can be Stored at the Same Level as A, Although A Should be Stored as well. A’ can be Regarded as an Index for A. For Pictures, We may be Able to Use Pictures of Low Resolution. • Transcoding : Generating New Formats for Original Data • Summarizing : Generating Text Only Summary of Original Data

  34. Queries for Data and Usage Data

  35. Queries 2 1 1 0 1 Queries to Data Objects In CBFWW • A Salient Feature of CBFWW to Distinguish it From a Cache is the Query Capability. • Caches Not Allowing Direct Use of Cached Data • Caches Causes the Majority of Data Waste • Our Analysis of ISP Data Reveals that Nearly 70% of Cached Contents Never Being Reused • The Rareness ( Reverse of Frequency) Also Obeys A Zipf-Like Distribution • Using Usage Information Maintained By System, We Can Introduce New Queries • Popularity-Aware Queries • Guided Navigation • Topic Sensor What and How Popular) Usage Results with Popularity

  36. Queries Popularity-Aware Queries • Assume An OQL(Object Query Language)-Like Language By Adding The Following Modifiers (Like DISTINCT in SQL) and Variables • Modifiers: MRU,LRU, MFU, LFU • Variables: Lastref, firstref A CBFWW Enables Popularity-Aware Queries, e.g., SELECT MRU p.oid, p.title FROM Physical_Page p WHERE p.title MENTION ‘‘data warehouse’’ This Is To Find Most Recently Retrieved Physical Pages Whose Titles Contain Phrase “Data Warehouse”,

  37. Queries Query for Logical Pages • Queries for Logical Pages Are Useful for Finding Cut-Paths in Finding Information, e.g., SELECT MFU l.path FROM Logical_Page l WHERE end_at(l.oid) IN ( SELECT p.oid FROM Physical_Page p WHERE p.url="http://www-db.cs.wisc.edu/cidr/"); This Is to Find the Most Frequently Traversed Paths That Target Towards the Home Page of CIDR Conference

  38. Experiments and Prototype Experiments and Prototype Implementation • Experiments • Demonstrate the Limit of Cache-Only Approach -- Majority of Cache Data Never Reused • Prototype Implementation • Show the Benefit from Management of History-Rich Web Data – Develop a Usage-Aware Search Engines Queries/Results with Usage Constraints Indices of Keywords For a Set of Documents Sampled from Proxy Logs Usage-Aware Search Engine Frequency of Reference For those Documents

  39. Experiments and Prototype Usage-Aware Usage-Blind Usage-Aware Web Search sports sports Usage Data from A Large ISP: Kyoto-Inet (Jan. 14 2002 ~ Feb. 14, 2002)

  40. Conclusions Conclusion • To Meet the Challenges Posed By the Web, We Proposed to Include Data Selection Capability of Cache to Data Management, Developed A New Data System, Called Capacity Bound-free Web Warehouse (CBFWW) • We Have Addressed the Following Issues Involved in the System • An Architecture • Data Management • Storage Management • Query Using Usage Data • We wre Currently Developing A Prototype to be used by a Provider Kyoto-inet.

More Related