Recovering Semantics of Tables on the Web

Recovering Semantics of Tables on the Web PetrosVenetis, AlonHalevy, JayantMadhavan, Marius Pas¸ca, Warren Shen, FeiWu, GengxinMiao, Chung Wu 2011, VLDB XunnanXu

Problems to Solve • Annotating tables (the recovery of semantics) • Title could be missing • Subjects could be missing • Relevant information might not be close at all • Improve table search • “Bloom period (Property) of shrubs (Class)” <- focused on in this paper • “Color (Property) of Azalea (Instance)”

Classify Items Using Databases • isA database • Berlin is a city. • CSCI572 is a course. • relation database • Microsoft is headquartered inRedmond. • San Francisco is located in California. • Why is this useful? • Tables are structured, more “popular” names could help identify others

Construction of isA Database • Extract pairs from web pages with patterns like: <[..] Class(C)[such as|including] Instance(I)[and|,|.]> • Easy? Not really… • To check the boundary of a Class: noun phrases whose last component is a plural-form noun and that are not contained in and do not contain another noun phrase • Michigan counties such as • Among the lovely cities • To check the boundary of an Instance: Ioccurs as an entire query in query logs

Improvements • Mine more instances • Headquartered in I => I is a city • Handle sentence duplicates: • Sentence fingerprint -> the hash of first 250 characters • Score the pairs: • Score(I, C) = Size({Pattern(I, C)})2 x Freq(I, C) • {Pattern(I, C)} – the set of patterns • Freq(I, C) – the number of appearances • Similar to tf/idf

Construction of Relation Database • TextRunner was used to extract the relations • TextRunneris a research project at the University of Washington. • It uses Conditional Random Field (CRF) to detect the relations among noun phrases. • CRF is a popular word in machine learning world: applying pre-defined feature functions to phrases to compute the final probability of a sentence (normalized score 0 ~ 1) • Example: • f(sentence, i, labeli, labeli-1) = 1 if word i is “in” and label i-1 is an adjective, otherwise 0=> Microsoft is headquartered in beautiful Redmond.

Assign Labels to Instances • Assumptions • If many instances in that column are assigned to a class, then the next instance very likely also belongs to it. • The best label is the one that is most likely to “produce” the observed values in the column. (maximum likelihood hypothesis) • Definitions • vi – value i in column • Li – possible label for that column, L(A) – the best label • U(li, V) – the score of label i after assigned to the set (V) of values

Assign Labels to Instances • According to the maximum likelihood assumption: • After applying Bayes function and normalization: • where Ks is the normalization factor • Pr[Li] -> estimated by the score in isA database • Pr[Li | vj] -> score(vj, Li) / ∑k score(vj, Lk) • Done?

Assign Labels to Instances • What if the label does not exist in the database? • What if the some popular instance – label pair has much higher score? • Final equation to compute Pr[Li | vj] using smoothing and 0 prevention • For final results, they considered only labels above certain threshold.

Results – Label Quantity and Quantity • Gold standard • Labels are manually evaluated by annotators • Vital > okay > incorrect • Allegan, Barry, Berrien –> Michigan counties (vital) • Allegan, Barry, Berrien -> Illinois counties (incorrect) • Relation quality • 128 binary relations using gold standard

Table Search Results Comparison • Results are fetched automatically but compared manually: • 100 queries, using top-5 of the results – 500 • Results were shuffled and evaluated by 3 people using single blinding test • Scores: • right on - has all information about a largenumber of instances of the class and values for the property • relevant - has information about only some of the instances, or of properties that were closelyrelated to the queried property • irrelevant • Candidates • TABLE – the method in this paper • GOOG – results from google.com • GOOGR – top 1000 results from Google intersected with the table corpus) • DOCUMENT – document-based approach

Table Search Results Comparison (a) right on, (b) right on or relevant, (c) right on or relevant and in a table

Recovering Semantics of Tables on the Web

Recovering Semantics of Tables on the Web

Presentation Transcript

Semantics of Web Services

WebTables: Exploring the Power of Tables on the Web

Recovering Semantics of Tables on the Web

On the Semantics of Argumentation

WebTables : Exploring the Power of Tables on the Web

EMEP data on the web SR Tables National reports

Web Semantics

Harvesting Relational Tables from Lists on the Web

Multimedia Semantics and the Semantic Web

REWERSE Reasoning on the Web with Rules and Semantics

Semantics of Web Services

WebTables : Exploring the Power of Tables on the Web

Harvesting Relational Tables from Lists on the Web

Semantics on the Web: How Do We Get There?

Harvesting Relational Tables from Lists on the Web

Preserving Semantics in the Web

Generating Linked Data by inferring the semantics of tables

Folding Tables at Lowest Price on the Web

Adding formal semantics to the Web

5. Semantics of Web Services

REWERSE Reasoning on the Web with Rules and Semantics

Introducing Web Tables