1 / 24

The Web in the Year 2010: Challenges and Opportunities for Database Research

The Web in the Year 2010: Challenges and Opportunities for Database Research. Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de. Importance of Database Technology. Brave New World. Back to Reality. Information at your fingertips. Flooded by junk & ads. Unreliable services.

stoneg
Download Presentation

The Web in the Year 2010: Challenges and Opportunities for Database Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Web in the Year 2010:Challenges and Opportunitiesfor Database Research Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de

  2. Importance of Database Technology

  3. Brave New World Back to Reality Information at your fingertips Flooded by junk & ads Unreliable services Electronic commerce Poor responsiveness and vulnerable to load surges Interactive TV Digital libraries Needles in haystacks Terabyte servers Success stories require special care & investment What Have We Done to the Web?

  4. Observation: Importance of quality guarantees not limited to Web  DFG graduate program at U Saarland Prediction for 2010: Either we will have succeeded in achieving these qualities, or we will face world-wide information chaos ! The Grand Challenge:Service Quality Guarantees Continuous Service Availability ”Our ability to analyze and predict the performance of the enormously complex software systems ... are painfully inadequate" (PITAC Report) Money-back Performance Guarantees Guaranteed Search Result Quality

  5. Outline Why I’m Speaking Here  Observations and Predictions • Money-back Performance Guarantees • Continuous Service Availability • Guaranteed Search Result Quality • Summary of My Message •

  6. Internal Server Error. Our system administrator has been notified. Please try later again. Check Availability (Look-Up Will Take 8-25 Seconds) Users (and providers) need performance guarantees ! Unacceptably slow servers are like unavailable servers. With huge number of clients, guarantees may be stochastic. From Best Effort to Performance Guarantees Observations: • Web service performance is best-effort only • Response time is unacceptable during peak load • because of queueing delays • Performance is mostly unpredictable ! Example:

  7. Admission control to ensure QoS: 0 T 2T 3T Stochastic model: N N = + + Ttrans,i T T T å å serv seek rot , i = = i 1 i 1 ... Chernoff bound Auto-configure server: admission control, #disks, etc. Example: Video (& Audio) Server Clients Partitioning of continuous data objects with variable bit rate into fragments of constant time length T Periodic scheduling in rounds of duration T fragment streams with deadlines for QoS Server yes, go ahead no way 4T

  8. stochastic guarantees for all data and services, e.g., of the form Predictions for 2010: „Web engineering“ for end-to-end QoS will rediscover stochastic modeling or will fail special-purpose, self-tuning servers with predictable performance „low-hanging fruit“ engineering: 90% solution with 10% intellectual effort money-back guarantees after trial phase asap alerting about necessary resource upgrading Observations: Observations and Predictions stochastic modeling is a crucial asset, but realistic modeling is difficult and sometimes impossible resource dedication can simplify the problem

  9. Outline Why I’m Speaking Here  Observations and Predictions • Money-back Performance Guarantees   Continuous Service Availability Guaranteed Search Result Quality • Summary of My Message •

  10. Ranking by descending relevance Similarity metric: Vector Space Model for Content Relevance Search engine Query (set of weighted features) Documents are feature vectors

  11. generalizes to multimedia search e.g., using: tf*idf formula Vector Space Model for Content Relevance Ranking by descending relevance Similarity metric: Search engine Query (Set of weighted features) Documents are feature vectors

  12. + Consider in-degree and out-degree of Web nodes: Autority Rank (di) := Stationary visit probability [di] in random walk on the Web Link Analysis for Content Authority Ranking by descending relevance & authority Search engine Query (Set of weighted features) Reconciliation of relevance and authority based on ad hoc weighting

  13. Web Search Engines: State of the Art q = „Chernoff theorem“ AltaVista: Fermat's last theorem. Previous topic. Next topic. ...URL: www-groups.dcs.st-and.ac.uk/~history/His...st_theorem.html Northernlight: J. D. Biggins- Publications. Articles on the Branching Random Walk http:/ / www.shef.ac.uk/ ~st1jdb/ bibliog.html Lycos: SIAM Journal on Computing Volume 26, Number 2 Contents Fail-Stop Signatures ... http://epubs.siam.org/sam-bin/dbq/toc/SICOMP/26/2 Excite: The Official Web Site of Playboy Lingerie Model Mikki Chernoff http://www.mikkichernoff.com/ Google: ...strong convergence \cite{Chernoff}. \begin{theorem}\label{T1} Let... http://mpej.unige.ch/mp_arc/p/00-277 Yahoo: Moment-generating Functions; Chernoff's Theorem; The Kullback-... http://www.siam.org/catalog/mcc10/bahadur.htm Mathsearch: No matches found.

  14. Research Avenues: Leverage advanced IR: automatic classification Organize information and leverage IT megatrends: XML But There Is Hope Observation: Starting from (intellectually maintained)directory or promising but still unsatisfactory query results and exploring the neighborhood of these URLs would eventually lead to useful documents But intellectual time is expensive !

  15. assign to highest-likelihood category Naive Bayes classifier: or other classifiers (e.g., SVM) with multinomial prior and estimated p1k, ... p|F|k, qk Training sample Good for query expansion and user relevance feedback ... Ontologies and Statistical Learning forContent Classification Science Categories Feature space: term frequencies fj New docs ... Mathematics Algebra ... Probability and Statistics Hypotheses Testing Large Deviation ...

  16. www.links2go.com: Chernoff theorem

  17. travelguide ... place: ... place: Zion NP ... lodging location: Utah hikes ... Dozent URL=... Inhalt motel price=$55 ... activities: hiking, canyoneering ... hike type=backcountry level=difficult: Kolob Creek ... class 5.2 ... trip report Semistructured data: elements, attributes, links organized as labeled graph For Better Search Quality: XML to the Rescue <travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> <motel price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...

  18. Regular expressions over path labels + Logical conditions over element contents And ... #.tripreport Like ... ... many technical obstacles ... ... 15 feet dropoff ... ... need 100 feet rope ... Querying XML travelguide travelguide ... place place: ... place: Zion National Park place ... lodging lodging location: Utah hikes Example query: difficult hikes in affordable national parks ... motel price=$55 ... activities: hiking, canyoneering Select P From //travelguide.com Select P From //travelguide.com Where place Like „%Park%“ As P And P.#.lodging.#.(price|rate) < $70 And P.#.hike+.level? Like „%difficult%“ hike hike type=backcountry level=difficult: Kolob Creek ... class 5.2 ... trip report Where place Like „%Park%“ As P And P.#.lodging.#.(price|rate) < $70 And P.#.hike+.level? Like „%difficult%“

  19. <outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ... <travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ... <outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenginghike ... </trip> ... </region> ... Dozent URL=... Inhalt ... Result ranking of XML data based on semantic similarity ... XXL: Reconciling XML and IR Technologies Example query: difficult hikes in affordable national parks canyoneering Select P From //all-roots Where ~place ~ „Park“ As P And P.#.~lodge.#.~price < $70 And P.#.~hike ~ „difficult“ And P.#.~activities ~ „climbing“ climbing

  20. Research Issues: Which kind of ontology (tree, lattice, HOL, ...) ? Which feature selection for classification? Which classifier? Information-theoretic foundation? Theory of „optimal“ search results? Ontologies, Statistical Learning, and XML:The Big Synergy Research Avenue: build domain-specific and personalized ontologies  leverage XML momentum ! automatically classify XML documents into ontology expand query by mapping query into ontology by adding category-specific path conditions  ex.: #.math?.#.(~large deviation)?.#.theorem.~Chernoff exploit user feedback

  21. Research Avenue: future search engines need new paradigm in new world with > 90% information in XML and „deep web“ with > 90% information behind gateways The Mega Challenge: Scalability Observations: search engines cover only „surface web“: 1 Bio. docs, 20 TBytes most data is in „deep web“ behind gateways: 500 Bio. docs, 8 PBytes

  22. will be able to find results for every search in one day with < 1 min intellectual effort that the best human experts can find with infinite time should have a theory of search result „optimality“ should carry out large-scale experiments Predictions for 2010 future search engines will combine pre-computation (and enhance with richer form of indexing) & additional path traversal starting from index seeds (topic-specific crawling with „semantic“ pattern matching) & dynamic creation of subqueries at gateways  XXX (Cross-Web XML Explorer), aka. Deep Search

  23. Outline Why I’m Speaking Here  Observations and Predictions  Money-back Performance Guarantees   Continuous Service Availability  Guaranteed Search Result Quality Summary of My Message •

  24. Conceivable killer arguments: Infinite RAM & network bandwidth and zero latency for free Smarter people don‘t need a better Web Strategic Research Avenues inspired by Jim Gray‘s Turing Award lecture: trouble-free systems, always up, world memex Challenges for 2010: Self-tuning servers with response time guarantees by reviving stochastic modeling and combining it with „low-hanging fruit“ engineering Continuously available servers by new theory of recovery contracts for multi-tier federations in combination with better engineering Breakthrough on search quality (incl. optimality theory ?) from synergy of ontologies, statistical learning, and XML

More Related