130 likes | 270 Views
Query Processing in Data Integration + a (corny) ending. 4/30. Project 3 due today Demos (to the TA) as scheduled FHW+presentation due 5/8 Agenda today: 3:15—3:30: Soft Joins 3:30—4:00: Query processing in data integration 4:00—4:30: End review.
E N D
Query Processing in Data Integration+a (corny) ending 4/30 Project 3 due today Demos (to the TA) as scheduled FHW+presentation due 5/8 Agenda today: 3:15—3:30: Soft Joins 3:30—4:00: Query processing in data integration 4:00—4:30: End review
May 8th 2:40—4:30pm • Each student gives a 5 min presentation • 19x5=95min • Also get a hard copy of the review with you • 15min buffer + wrapup • I’ll get refreshments; you keep us all awake.
What is the problem that the paper is addressing? Why is the problem interesting? • What is the solution that the authors propose? • What is your criticism of the solution presented? • How is it related to what we learned in the class?
Course Outcomes • After this course, you should be able to answer: • How search engines work and why are some better than others • Can web be seen as a collection of (semi)structured databases? • If so, can we adapt database technology to Web? • Can useful patterns be mined from the pages/data of the web? What did you think these were going to be?? REVIEW
Main Topics • Approximately three halves plus a bit: • Information retrieval • Information integration/Aggregation • Information mining • other topics as permitted by time REVIEW
Adapting old disciplines for Web-age Information (text) retrieval Scale of the web Hyper text/ Link structure Authority/hub computations Databases Multiple databases Heterogeneous, access limited, partially overlapping Network (un)reliability Datamining [Machine Learning/Statistics/Databases] Learning patterns from large scale data REVIEW
Clustering (2) Text Classification (1) Filtering/Recommender Systems (2) Why do we even care about databases in the context of web (1) XML and handling semi-structured data + Semantic Web standards (3) Information Extraction (2) Information/data Integration (2+) Topics Covered • Introduction (1) • Text retrieval; vectorspace ranking (3) • Correlation analysis & Latent Semantic Indexing (2) • Indexing; Crawling; Exploiting tags in web pages (2) • Social Network Analysis (2) • Link Analysis in Web Search (A/H; Pagerank) (3+) Discussion Classes: ~3+
Big Idea 1 Finding“Sweet Spots” in computer-mediated cooperative work • It is possible to get by with techniques blythely ignorant of semantics, when you have humans in the loop • All you need is to find the right sweet spot, where the computer plays a pre-processing role and presents “potential solutions” • …and the human very gratefully does the in-depth analysis on those few potential solutions • Examples: • The incredible success of “Bag of Words” model! • Bag of letters would be a disaster ;-) • Bag of sentences and/or NLP would be good • ..but only to your discriminating and irascible searchers ;-)
Collaborative Computing AKA Brain Cycle StealingAKA Computizing Eyeballs Big Idea 2 • A lot of exciting research related to web currently involves “co-opting” the masses to help with large-scale tasks • It is like “cycle stealing”—except we are stealing “human brain cycles” (the most idle of the computers if there is ever one ;-) • Remember the mice in the Hitch Hikers Guide to the Galaxy? (..who were running a mass-scale experiment on the humans to figure out the question..) • Collaborative knowledge compilation (wikipedia!) • Collaborative Curation • Collaborative tagging • Paid collaboration/contracting • Many big open issues • How do you pose the problem such that it can be solved using collaborative computing? • How do you “incentivize” people into letting you steal their brain cycles?
Tapping into the Collective UnconsciousAKA “Wisdom of the Crowds” Big Idea 3 • Another thread of exciting research is driven by the realization that WEB is not random at all! • It is written by humans • …so analyzing its structure and content allows us to tap into the collective unconscious .. • Meaning can emerge from syntactic notions such as “co-occurrences” and “connectedness” • Examples: • Analyzing term co-occurrences in the web-scale corpora to capture semantic information (today’s paper) • Analyzing the link-structure of the web graph to discover communities • DoD and NSA are very much into this as a way of breaking terrorist cells • Analyzing the transaction patterns of customers (collaborative filtering)
If you don’t take Autonomous/Adversarial Nature of the Web into account, then it is gonna getcha.. Big Idea 4 • Most “first-generation” ideas of web make too generous an assumption of the “good intentions” of the source/page/email creators. The reasonableness of this assumption is increasingly going to be called into question as Web evolves in an uncontrolled manner… • Controlling creation rights removes the very essence of scalability of the web. Instead we have to factor in adversarial nature.. • Links can be manipulated to change page importance • So we need “trust rank” • Fake annotations can be added to pages and images • So we need ESP-game like self-correcting annotations.. • Fake/spam mails can be sent (and the nature of the spam mails can be altered to defeat simple spam classifiers…) • So we need adversarial classification techniques • Fake pages (in large numbers) can be created and put on the web (although, as of now, I don’t yet see the economic motive for this) • So we can not see web as the collective unconscious.. and co-occurrence may not imply semantic proximity.
Anatomy may be likened to a harvest-field. • First come the reapers, who, entering upon untrodden ground, cut down great store of corn from all sides of them. These are the early anatomists of Europe • Then come the gleaners, who gather up ears enough from the bare ridges to make a few loaves of bread. Such were the anatomists of last. • Last of all come the geese, who still contrive to pick up a few grains scattered here and there among the stubble, and waddle home in the evening, poor things, cackling with joy because of their success. Gentlemen, we are the geese. --John Barclay English Anatomist
Information Integration on Web still rife with uncut corn • Unlike anatomy of Barclay’s day, Web is still young. We are just figuring out how to tap its potential • …You have great stores of uncut corn in front of you. • …… go cut some of your own!