500 likes | 663 Views
Web 2.0 Data Analysis . Daniel Deutch. Data Management. “Data management is the development, execution and supervision of plans, policies, programs and practices that control , protect , deliver and enhance the value of data and information assets”
E N D
Web 2.0 Data Analysis Daniel Deutch
Data Management “Data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets” (DAMA Data Management Body of Knowledge) A major success: the relational model of databases
Relational Databases • Developed by Codd (1970), who won the Turing award for the model • Huge success and impact: • The vast majority of organizational datatoday is stored in relational databases • Implementations include MS SQL Server, MS excel, Oracle DB, mySQL,… • 2 Turing award winners (Edgar F. Codd and Jim Gray) • Basic idea: data is organized in tables (=relations) • Relations can be derived from other relations using a set of operations calledthe relational algebra • On which SQL is largely based
Research in Data(base) Management • 1970- : Relational Databases (tables). • Indexing, Tuning, Query Languages, Optimizations, Expressive Power,…. • ~20 years ago: Emergence of the Web and research on Web data • XML, text database, web graph…. • Google is a product of this research (by Stanford’s PhD students Brin and Page) • Recent years: hot topics include distributed databases, data privacy, data integration, social networks, web applications, crowdsourcing,trust,… • Foundations taken from “classical” database research • Theoretical foundations with practical impact
Web 2.0 • “Old” web (“Web 1.0”): static pages • News, encyclopedic knowledge... • No, or very little, interactive process between the web-page and the user. • Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. • “Network as platform" computing • The “participatory Web”
Web 2.0 • “Old” web (“Web 1.0”): static pages • News, encyclopedic knowledge... • No, or very little, interactive process between the web-page and the user. • Web 2.0: A term very broadly used for web-sites that use new technologies (Ajax, JS..), allowing interaction with the user. • “Network as platform" computing • The “participatory Web”
Data is all around • Web graph • “Social graph” • Pictures, Videos, notifications, messages.. • Data that the application processes • Advertisments • Even the application structure itself
Need to Analyze • Huge amount of data out there • Est. 13.68 billion web-pages and counting • Half a billion tweets per day and counting • An average user “sees” about 600 tweets per day • Most of it is irrelevant for you, some is incorrect
Static vs. Dynamic data • For “static” web data: problem addressed by search engines • Allow users to specify criteria for search (e.g. keywords), and rank the results. • For “dynamic” web 2.0 data (tweets, notifications, web applications): • No satisfying solution yet. • Data volume is constantly increasing.
Filter, Rank, Explain • Filter • Select the portion of data that is relevant • Group similar results • Rank • Rank data by trustworthiness, relevance, recency... • Present highest-rank first • Explain • An explanation of whyis the data considered relevant/highly-ranked • An explanation of howhas the data propagated • “Why do I see this?”
Approach • Leverage knowledge from “classic” database research • Account for the new challenges • Do so in a generic manner • Leverage unique features such as collaborative contribution, distribution, etc.
sname sid=sid cid=cid name=“Mary” Courses Takes Students Data model Query language Students Select… From… Where… Physical Storage Indexing Distribution ...
Foundations • Model • Query Language • Query evaluation algorithms • Prototype implementation and optimizations • Getting Data and Testing
Two concrete examples • Analysis of Web Applications • Analysis of Social Networks
Analysis Questions • For the application owner: • Can the user make a reservation without giving credit card details? • What is the probability that a user would choose a category for which there are no available products? • For the application user: • How can I get the best deals (by navigating)? • How can I navigate in an optimal way (minimal number of clicks)
Modeling • Model should be declarative, intuitive • Abstract representation • Allows readability, portability, applicability, optimizations… • Tradeoffbetween expressivity and complexity
Effects guiding the execution • Dynamic effects such as the choice of which link to follow • Probability/cost value can be associated with every such choice • Cumulative: if the same choice is made twice, double the cost • Static effects such as available products in the database • Static in the sense that it is constant in a single execution • Thus non-cumulative
Filter, Rank, Explain • Filter • Choose “interesting” navigation sequences • Rank • Rank different sequences by length, purchase cost,… • Explain • Why are these the best sequences? What do they entail in terms of purchase? How can they be refined?
Query Language • Desired executions/navigation sequences are typically expressed in temporal logic • Database analysis is typically done through a First-order-based query language (e.g. SQL) • A query language for data-dependent application should combine both. • While allowing for efficient evaluation. • LTL-FO
Query Evaluation • Problem: Given a process specification and a query, evaluatethe query • To find all executions that satisfy the criteria • To find the probability that the criteria are satisfied • To find the minimal cost execution realizing the criteria • …
Interesting problems are hard (sometimes) • The (corresponding boolean) problems are NP-hard • Very unlikely that they can be solved efficiently for worst case input • This is often the case when modeling real-life problems • To be considered, but not to be feared!
What do you do with a theoretically hard problem? • Find restricted, yet realistic, cases that are “easy”, i.e. admit efficient worst case algorithms • This also allows to have a better understanding of “where is the hardness” • Example: If there are only “few” static choices, the problem becomes easy • Design optimization techniques for practically efficient implementations
Optimization • One idea is based on the notion of data provenance • Originally for “classic” database queries, provenance is a “concise” representation of the query computation • Here we generalize the approach for the temporal queries • Allows for a generic approach and many optimizations • A system prototype for provenance-aware analysis of processes is under development
How do we get data? • Some applications (e.g. Yahoo! Shopping) allows a (limited)interface to their database. • Application structure can be obtained by crawling. • Allows to prove that the model is realistic, and to test feasibility of algorithms on real-life data scale
Open problems • Uncertainty and partial knowledge • Data updates by the application • Incremental maintenance • How do modifications to the database can be accounted for without restarting the analysis?
Analysis of Social Networks • Social network common model of disseminating data is in a “push” mode • Facebook’s news feed, Twitter’s tweets… • Huge load of information, and it continues to grow • No standard way of searching and ranking such posts
A Machine Learning Approach • “Learn” users preferences • Ask them whether they like a particular post/tweet, or use existing mechanisms • Try to generalize by extracting characteristic features (e.g. topics) • Use the generalized function to predict future preferences
A data management approach • Allow users to tell the system their preferences, through a “query” language. • One family of such languages are rule-based languages • Prolog, Datalog,... • Simpler than standard programming languages • Allows for easier analysis
Rule based language(Logic Programming) descendent(X,Y):- parent(X,Y) descendent(X,Y):- parent(X,Z),descendent(Z,Y)
For tweeter See(tweet):-Tweets(Tova,tweet) See(tweet):-Tweets(x,tweet), related(x,"basketball") Attending(party):- count[friend(x),like(x,party)]>10
Learning • Unlike standard database querying, some of the predicates may be difficult to evaluate • How do we know that a post is related to basketball? • What can we do?
What can we do • Text analysis • Problem: An ”I love basketball” tweet is not really a good match • Use tagging if exists • Ask other peers in the network! • Can also be used to propose predicates to users
Influence • Predict effects and allow users to influence friends • "How do I get at least 5 of my friends to come with the party with me?" • Need them to (1) see the party, (2) click "attending" • "How do I get at least 1000 people to go to a political rally?" • Can be answered by an analysis of the rules!
Interface for Specifying Rules • Cannot expect Facebook/Twitter users to be able to program • Need to design an interface • The interface components can be dynamically designed according to answers by other peers • An ongoing research
Conclusions • Web 2.0 data poses new challenges in research • Data is dynamic, access mode is unique, analysis goals different than in classic databases • Lots of research on theoretical questions with strong practical applications