540 likes | 670 Views
Data Integration for the Relational Web. Katsarakis Michalis. Data Integration for the Relational Web. Presentation of the paper: Michael J. Cafarella , Alon Halevy, and Nodira Khoussainova . 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), 1090-1101
E N D
Data Integration for the Relational Web KatsarakisMichalis
Data Integration for the Relational Web Presentation of the paper: Michael J. Cafarella, Alon Halevy, and NodiraKhoussainova. 2009. Data integration for the relational web. Proc. VLDB Endow. 2, 1 (August 2009), 1090-1101 for the needs of the course hy562 KatsarakisMichalis
Octopus system in one slide Country Name Institute
Octopus system in one slide • Search • Find relations relevant to user’s query string • Cluster similar tables together • Context • Enrich relations with data from the surrounding text • Extend • Adorn an existing relation with additional data columns derived from other relations
Index • Integration Operators • Algorithms • Implementation at Scale • Experiments • Related Work • Conclusions
Integration Operators Algorithms Implementation at Scale Experiments Related Work Conclusions Integration Operators
Search Operator Ordered List of Clusters of Relations 1 3 4 2 Ordered List of relevant Relations 1 2 3 4 Keyword query string RelevanceRanking Clustering Extracted Set of Relations
Search Operator (2) • Search operator finds relevant data over the Web and then clusters the result. • Each member table of the cluster is a concrete table that contributes to the Clusters Schema Relation
Context Operator T enriched with new columns Context T’s source web page Extracted Relation T
Context Operator (2) Course id Semester
Context Operator (3) • Data values that hold for every tuple are generally “projected out” and added to the Web page’s surrounding text. • Context takes as input a single extracted Table T and modifies it to contain additional columns, using data retrieved from T’s source Web Page
Extend Operator Topic Keyword k Extended T’ Extend Column c of relation T
Extend Operator (2) • Enables the user to add more columns to the table by performing a join. • Takes a column “c” of table T as input and a topic keyword “k”. • It returns 1or more columns whose values are described by k. • The new column added to T does not necessarily come from a single data source. • It gathers data from large number of sources. • It can also gather data from table with different label from k or no label at all.
Integration Operators Algorithms Implementation at Scale Experiments Related Work Conclusions Algorithms
Algorithms • Search • Ranking • Clustering • Context • Extend • Search: • Rank the Table by relevance to Users Query • Cluster other related tables around top ranking Search result.
Ranking Algorithms • Simple Rank • Transmits the users search query to Web Search engine, obtains the URL ordering and presents the data according to that order. • Drawbacks: • Ranks Individual whole page and not the data on that page. • Eg: persons home page contains a HTML list that serve as navigation list to other pages. • When multiple data sets are present on the web page, SR algorithm relies on in-page ordering. (ie. In the order of its appearance) • Any metadata about the HTML lists exists only in the surrounding text and not the table itself. • Cannot count hits between the query and a specific tables metadata.
Ranking Algorithms (2) • SCPRank
Ranking Algorithms (3) • SCPRank • Uses symmetric conditional probability to measure correlation between cell in extracted database and query term. It is defined as: • How likely the term q and c appear together in a document. • SCPRank scores the table and not the cell. • It sends the query to the Search Engine, extracting a candidate set of tables. • Then it computes per-column scores, each of which is sum of per-cell SCP score in the column. • The tables overall score is the max of all of its per-column scores. • Finally it sorts the tables in the order of their scores and returns a ranked list. • Time consuming. • Compute score for first ‘r’ rows of every candidate table. • Approximating SCP score on a small subset of Web corpus.
Embedded Appendix:symmetric conditional probability • Let s be a term. The p(s) is the fraction of web documents that contain s • Similarly, p(s1, s2) is the fraction of documents containing both s1 and s2: • The SCP between a query q and the text in a data cell c is defined as follows: • Indicates how likely the term q and c appear together in a document.
Clustering Algorithms • TextCluster • computes tf-idf cosine dist between texts of table a and text of table b. • SizeCluster • computes column to column similarity score that measures the difference in mean string length between them. • The overall table-to-able similarity score for a pair of table is sum of per column score for best column-to-column matching. • ColumnCluster • Its similar to Size Cluster however it computes a tf-idf cosine distance using only the text found in the 2 columns.
Embedded Appendix:tf-idf • term frequency–inverse document frequency • reflects how important a word is to a document in a collection or corpus • highest when the term occurs many times within a small number of documents • lower when the term occurs fewer times in a document, or occurs in many documents • lowest when the term occurs in virtually all documents
Context Algorithms • SignificantTerms • Examines the source page of the extracted table and returns the k terms with the highest tf-idf values and do not appear in the extracted data. • RVP (Related View Partners) • Looks beyond the source page. • Operating on the table T, it obtains a large number of candidate related view tables, by using each value in T as parameter for a new Web Search • Then filters out tables that are unrelated to t’s source page, by removing all tables that do not contain at least one value from ST(T) • It obtains all the data value in the remaining table and ranks them according to the frequency of occurrence, returns the k highest ranked values.
Context Algorithms (2) • Hybrid • It uses the fact that the above 2 algorithm are complimentary in nature. • ST finds the context terms that RVP misses and RVP discovers the context terms that ST misses. • Hybrid returns the context term that appear in result of either algorithm.
Extend Algorithms Ordered List of Joinable Tables 1 3 2 • JoinTest Jaccardian Distance Threshold:Distance ≤
Extend Algorithms (2) • JoinTest • Combines web search and key-matching to perform schema matching • Uses Jaccardian distance to measure the compatibility between the values of T’s column c and each column of in each candidate table. • If the distance is greater than a constant threshold t, we consider the tables to be joinable • All tables that pass this threshold, are sorted by relevance to keyword k
Embedded Appendix:Jaccardian Distance • Jaccard similarity coefficient • measures similarity between sample sets • Jaccardian Distance • measures dissimilarity between sample sets
Extend Algorithms (3) Clusters of Relations, Ordered by Relevance and JoinScore Ordered List of relevant Relations • MultiJoin 1 2 3 4 3 1 2 Web Search for every pair (c.cell, k) Clustering Topic Keyword k
Extend Algorithms (4) • MultiJoin • Attempts to join each tuple of in the source table T with a potentially different table • Can handle the case when there is no single joinable table. • Issues a distinct web search query for every (c.cell,k) pair • Clusters the results • Ranks the clusters, using a combination of relevance score for the ranked table and a join score for the cluster. • JoinScore counts how many unique values from from T’s c column elicited tables in the cluster via the web search step
Integration Operators Algorithms Implementation at Scale Experiments Related Work Conclusions Implementation AT Scale
Implementation at Scale • Question: Can Octopus ever provide low latencies for a mass audience? • Challenges • Traditional relevance-based Web search chalenges • Non-adjacent SCP computations for • Search ScpRank algorithm • Multi-Query web searches for • Context RVP algorithm • Extend MultiJoin algorithm • Search engines can afford to spend a huge amount of resources in order to quickly process a single query, but the same is not true for one Contopus user who yields tens of thousands of queries • Case 1: 2 small prototype back-end systems • Case 2: Approximation techniques to make it computationally feasible
Non-adjacent SCP computations • Not feasible to precompute word-pair statistics: just for pairs of tokens, each sampled document would yield O(w2) unique token combinations • Miniature search engine that fits entirely in memory • 100GiB RAM over 100 machines • Few billion web pages • No absolute precision for hitcount numbers (in order to save memory by representing document setsusing Bloom Filters)
Embedded Appendix:Bloom Filter • A Bloom filter, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set • Query can return • "inside set (may be wrong)“ • "definitely not in set"
Multi-Query web searches • The naïve Context RVP algorithm implementation requires r*d Web searches • r: number of tables processed by Context • d: average number of sampled non-numeric data cells in each table • d in fairly low values (e.g.30) • RVP offers a real gain in quality • MultiJoin has a smaller problem, as it needs 1 query per row
Integration Operators Algorithms Implementation at Scale Experiments Related Work Conclusions Experiments
Experiements • The goal is to evaluate the quality of results generated by each Octopus Oerator • Collecting Queries • Collected a diverse query load from Web Users, using Amazon Mechanical Turk. Each user suggested • Topic of Data Table • 2 distinct URLs that provide example tables
Ranking Experiments • Run the ranking phase of search on each of the above 52 queries, first using SimpleRank, then ScpRank • Two judges, drawn from Amazon Mechanical Turk, labeled the table’s relevance to the query, on a scale 1-5. • Table was marked as relevant only when both judges gave score 4 or higher
Ranking Experiments (2) • Results • ScpRank performs substantially better than SimpleRank, especially in Top-2 case. • The extra computational overhead clearly offers real gains in result quality
Clustering Experiments • Issued queries and obtained a sorted list of tables, using ScpRank • Best Table for each result manually chosen and used as center input to the clustering system • Cluster quality assessed by computing the percentage of queries in which a k-sized cluster contains a table that is “highly similar” to the center. • Determine whether a table is “highly similar”, by asking two users from Amazon Mechanical Turk to rate the similarity of the pair in a scale 1-5. • Table was marked as “highly similar” only when both judges gave score 4 or higher
Clustering Experiments (2) • Results • k: cluster size: the system has only k “guesses” to find a table that is similar to the center • Little variance in quality across all algorithms
Context Experiments • Top-1 relevant table per query • Two of the authors manually reviewed each Table’s source page, noting terms that appeared to be useful context values • The values that both reviewers noted, were added in the test set of true context values • Within the test set, there is a median of 3 test context values per table • Measured the percentage of tables, where a true context value is included in the top-k of the context terms, generated by each algorithm
Context Experiments (2) • Results • Context can adorn a table with useful data from the surrounding text over 80% of the time • Although the RVP and SignificantTerms are not disjoint, RVP is able to discover new context terms that were missed by SignificantTerms • SignificantTerms does not yield the best output quality, but it is still efficient and very easy to implement
Extend Experiments • A small number of queries that appear to be Extend-able were chosen • Top-1 ranked “relevant” table returned from search was used • Join column c and topic keyword query k were chosen by hand opting for values that appear to be ammendable to Extend processing
Extend Experiments (2) • Results • JoinTest(tries to find a single satisfactory table) only found extended tuples in 3 cases • Countries • US Cities • UK Political Parties • In this 3 cases, 60% of tuples were extended • MultiJoin found extended data for all cases • On average, 33% of the source tuples were extended • MultiJoin has a lower rate of tuple-extension than JoinTest • MultiJoin finds an average of 45.5 correct extension values for every successfully –extended source tuple. • MultiJoin shows flexibility on per-tuple approach • With MultiJoin, fewer rows may be extended, but at least some data can be found.
Experiments Summary • It is possible to obtain high-quality results for all three Octopus operators • Even with imperfect outputs, Octopus improves the productivity of the user • Promising areas of future research • Output quality • Algorithmic runtime performance
Integration Operators Algorithms Implementation at Scale Experiments Related Work Conclusions Related Work