110 likes | 183 Views
Web Data Management Dr. Daniel Deutch. Web Data. The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of challenges Web data is huge, unstructured, heterogonous, partially incorrect.. Just the ingredients of a fun topic!. Goals.
E N D
Web Data • The web has revolutionized our world • Data is everywhere • Constitutes a great potential • But also a lot of challenges • Web data is huge, unstructured, heterogonous, partially incorrect.. • Just the ingredients of a fun topic!
Goals • Searching for relevant web-pages • E.g. given keywords • Understanding the results • Ranking the results • Combining results from different sources • E.g. Social networks +Search history • Combining rankings • Recommendations • Movies, restaurants..
Types of Data On the Web • Text • XML • Tables • Hyperlinks • Semantic tags • …
Challenges • Scale • The web is huge.. • Heterogonous sources • Different models and analysis techniques need to be designed • Uncertainty • A lot of errors (intentional or not) in data • A lot of errors in understanding data • Probabilistic modeling will be needed
Ingredients (Unordered) • Web Data Types • Semi-structured • Structured • Unstructured • Modeling & Storage • XML, text and relational DB representation • XML Typing & querying • Text models • Search and Retrieval • Crawling • Querying • Information Retrieval and Extraction (basics)
Text Analysis • POS tagging • Ranking • HITS algorithm • Google PageRank • Rank Aggregation and Top-K algorithms • Recommendations • Collaborative Filtering • The NetFlix Million Dollars Challenge
Semantic Web • Onthologies • Data Integration • Deriving semantic information • Wikipedia as an example • Web Services and Business Processes • BPEL, WSDL standards • Orchestration • Mashups • Analysis
Advanced Topics (time permitting) • Querying the deep web • Online advertisements • Models • Algorithms • Distributed Data Management • MapReduce and PigLatin
Resources • Web-site • Accessible from http://cs.tau.ac.il/~danielde • Slides, exercises, links.. • Book • http://webdam.inria.fr/Jorge/index.php • Free full version available online • Papers • Links will be available when relevant
Your Duties • 70% Final Exam • 30% Exercises • Including programming tasks