The Web in the Year 2010: Challenges and Opportunities for Database Research

The Web in the Year 2010:Challenges and Opportunitiesfor Database Research Gerhard Weikum weikum@cs.uni-sb.de http://www-dbs.cs.uni-sb.de

Importance of Database Technology

Brave New World Back to Reality Information at your fingertips Flooded by junk & ads Unreliable services Electronic commerce Poor responsiveness and vulnerable to load surges Interactive TV Digital libraries Needles in haystacks Terabyte servers Success stories require special care & investment What Have We Done to the Web?

Observation: Importance of quality guarantees not limited to Web  DFG graduate program at U Saarland Prediction for 2010: Either we will have succeeded in achieving these qualities, or we will face world-wide information chaos ! The Grand Challenge:Service Quality Guarantees Continuous Service Availability ”Our ability to analyze and predict the performance of the enormously complex software systems ... are painfully inadequate" (PITAC Report) Money-back Performance Guarantees Guaranteed Search Result Quality

Outline Why I’m Speaking Here  Observations and Predictions • Money-back Performance Guarantees • Continuous Service Availability • Guaranteed Search Result Quality • Summary of My Message •

Internal Server Error. Our system administrator has been notified. Please try later again. Check Availability (Look-Up Will Take 8-25 Seconds) Users (and providers) need performance guarantees ! Unacceptably slow servers are like unavailable servers. With huge number of clients, guarantees may be stochastic. From Best Effort to Performance Guarantees Observations: • Web service performance is best-effort only • Response time is unacceptable during peak load • because of queueing delays • Performance is mostly unpredictable ! Example:

Admission control to ensure QoS: 0 T 2T 3T Stochastic model: N N = + + Ttrans,i T T T å å serv seek rot , i = = i 1 i 1 ... Chernoff bound Auto-configure server: admission control, #disks, etc. Example: Video (& Audio) Server Clients Partitioning of continuous data objects with variable bit rate into fragments of constant time length T Periodic scheduling in rounds of duration T fragment streams with deadlines for QoS Server yes, go ahead no way 4T

stochastic guarantees for all data and services, e.g., of the form Predictions for 2010: „Web engineering“ for end-to-end QoS will rediscover stochastic modeling or will fail special-purpose, self-tuning servers with predictable performance „low-hanging fruit“ engineering: 90% solution with 10% intellectual effort money-back guarantees after trial phase asap alerting about necessary resource upgrading Observations: Observations and Predictions stochastic modeling is a crucial asset, but realistic modeling is difficult and sometimes impossible resource dedication can simplify the problem

Outline Why I’m Speaking Here  Observations and Predictions • Money-back Performance Guarantees   Continuous Service Availability Guaranteed Search Result Quality • Summary of My Message •

Ranking by descending relevance Similarity metric: Vector Space Model for Content Relevance Search engine Query (set of weighted features) Documents are feature vectors

generalizes to multimedia search e.g., using: tf*idf formula Vector Space Model for Content Relevance Ranking by descending relevance Similarity metric: Search engine Query (Set of weighted features) Documents are feature vectors

+ Consider in-degree and out-degree of Web nodes: Autority Rank (di) := Stationary visit probability [di] in random walk on the Web Link Analysis for Content Authority Ranking by descending relevance & authority Search engine Query (Set of weighted features) Reconciliation of relevance and authority based on ad hoc weighting

Web Search Engines: State of the Art q = „Chernoff theorem“ AltaVista: Fermat's last theorem. Previous topic. Next topic. ...URL: www-groups.dcs.st-and.ac.uk/~history/His...st_theorem.html Northernlight: J. D. Biggins- Publications. Articles on the Branching Random Walk http:/ / www.shef.ac.uk/ ~st1jdb/ bibliog.html Lycos: SIAM Journal on Computing Volume 26, Number 2 Contents Fail-Stop Signatures ... http://epubs.siam.org/sam-bin/dbq/toc/SICOMP/26/2 Excite: The Official Web Site of Playboy Lingerie Model Mikki Chernoff http://www.mikkichernoff.com/ Google: ...strong convergence \cite{Chernoff}. \begin{theorem}\label{T1} Let... http://mpej.unige.ch/mp_arc/p/00-277 Yahoo: Moment-generating Functions; Chernoff's Theorem; The Kullback-... http://www.siam.org/catalog/mcc10/bahadur.htm Mathsearch: No matches found.

Research Avenues: Leverage advanced IR: automatic classification Organize information and leverage IT megatrends: XML But There Is Hope Observation: Starting from (intellectually maintained)directory or promising but still unsatisfactory query results and exploring the neighborhood of these URLs would eventually lead to useful documents But intellectual time is expensive !

assign to highest-likelihood category Naive Bayes classifier: or other classifiers (e.g., SVM) with multinomial prior and estimated p1k, ... p|F|k, qk Training sample Good for query expansion and user relevance feedback ... Ontologies and Statistical Learning forContent Classification Science Categories Feature space: term frequencies fj New docs ... Mathematics Algebra ... Probability and Statistics Hypotheses Testing Large Deviation ...

www.links2go.com: Chernoff theorem

travelguide ... place: ... place: Zion NP ... lodging location: Utah hikes ... Dozent URL=... Inhalt motel price=$55 ... activities: hiking, canyoneering ... hike type=backcountry level=difficult: Kolob Creek ... class 5.2 ... trip report Semistructured data: elements, attributes, links organized as labeled graph For Better Search Quality: XML to the Rescue <travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> <motel price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ...

Regular expressions over path labels + Logical conditions over element contents And ... #.tripreport Like ... ... many technical obstacles ... ... 15 feet dropoff ... ... need 100 feet rope ... Querying XML travelguide travelguide ... place place: ... place: Zion National Park place ... lodging lodging location: Utah hikes Example query: difficult hikes in affordable national parks ... motel price=$55 ... activities: hiking, canyoneering Select P From //travelguide.com Select P From //travelguide.com Where place Like „%Park%“ As P And P.#.lodging.#.(price|rate) < $70 And P.#.hike+.level? Like „%difficult%“ hike hike type=backcountry level=difficult: Kolob Creek ... class 5.2 ... trip report Where place Like „%Park%“ As P And P.#.lodging.#.(price|rate) < $70 And P.#.hike+.level? Like „%difficult%“

<outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenging hike ... </trip> ... </region> ... <travelguide> <place> Zion National Park <location> Utah </> <activities> hiking, canyoneering </activities> <lodging> <lodge price = $ 80-140> Zion Lodge ... </> price = $55> ... </> </lodging> <hikes> <hike type=backcountry, level=difficult> Kolob Creek ... class 5.2 ... </hike> ... </place> <place> Yosemite NP <location> California </> <activities> hiking, climbing ... <outdoors> <region> Zion NP <state> Utah </> <things-to-do> hiking, canyoneering </things-to-do> <campground fee = $15> ... </> <backcountry trips> <trip> Kolob Creek ... challenginghike ... </trip> ... </region> ... Dozent URL=... Inhalt ... Result ranking of XML data based on semantic similarity ... XXL: Reconciling XML and IR Technologies Example query: difficult hikes in affordable national parks canyoneering Select P From //all-roots Where ~place ~ „Park“ As P And P.#.~lodge.#.~price < $70 And P.#.~hike ~ „difficult“ And P.#.~activities ~ „climbing“ climbing

Research Issues: Which kind of ontology (tree, lattice, HOL, ...) ? Which feature selection for classification? Which classifier? Information-theoretic foundation? Theory of „optimal“ search results? Ontologies, Statistical Learning, and XML:The Big Synergy Research Avenue: build domain-specific and personalized ontologies  leverage XML momentum ! automatically classify XML documents into ontology expand query by mapping query into ontology by adding category-specific path conditions  ex.: #.math?.#.(~large deviation)?.#.theorem.~Chernoff exploit user feedback

Research Avenue: future search engines need new paradigm in new world with > 90% information in XML and „deep web“ with > 90% information behind gateways The Mega Challenge: Scalability Observations: search engines cover only „surface web“: 1 Bio. docs, 20 TBytes most data is in „deep web“ behind gateways: 500 Bio. docs, 8 PBytes

will be able to find results for every search in one day with < 1 min intellectual effort that the best human experts can find with infinite time should have a theory of search result „optimality“ should carry out large-scale experiments Predictions for 2010 future search engines will combine pre-computation (and enhance with richer form of indexing) & additional path traversal starting from index seeds (topic-specific crawling with „semantic“ pattern matching) & dynamic creation of subqueries at gateways  XXX (Cross-Web XML Explorer), aka. Deep Search

Outline Why I’m Speaking Here  Observations and Predictions  Money-back Performance Guarantees   Continuous Service Availability  Guaranteed Search Result Quality Summary of My Message •

Conceivable killer arguments: Infinite RAM & network bandwidth and zero latency for free Smarter people don‘t need a better Web Strategic Research Avenues inspired by Jim Gray‘s Turing Award lecture: trouble-free systems, always up, world memex Challenges for 2010: Self-tuning servers with response time guarantees by reviving stochastic modeling and combining it with „low-hanging fruit“ engineering Continuously available servers by new theory of recovery contracts for multi-tier federations in combination with better engineering Breakthrough on search quality (incl. optimality theory ?) from synergy of ontologies, statistical learning, and XML

The Web in the Year 2010: Challenges and Opportunities for Database Research