120 likes | 285 Views
Recovering Semantics of Tables on the Web. Petros Venetis , Alon Halevy, Jayant Madhavan , Marius Pas¸ca , Warren Shen , Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu. Problems to Solve. Annotating tables (the recovery of semantics) Title could be missing
E N D
Recovering Semantics of Tables on the Web PetrosVenetis, AlonHalevy, JayantMadhavan, Marius Pas¸ca, Warren Shen, FeiWu, GengxinMiao, Chung Wu 2011, VLDB XunnanXu
Problems to Solve • Annotating tables (the recovery of semantics) • Title could be missing • Subjects could be missing • Relevant information might not be close at all • Improve table search • “Bloom period (Property) of shrubs (Class)” <- focused on in this paper • “Color (Property) of Azalea (Instance)”
Classify Items Using Databases • isA database • Berlin is a city. • CSCI572 is a course. • relation database • Microsoft is headquartered inRedmond. • San Francisco is located in California. • Why is this useful? • Tables are structured, more “popular” names could help identify others
Construction of isA Database • Extract pairs from web pages with patterns like: <[..] Class(C)[such as|including] Instance(I)[and|,|.]> • Easy? Not really… • To check the boundary of a Class: noun phrases whose last component is a plural-form noun and that are not contained in and do not contain another noun phrase • Michigan counties such as • Among the lovely cities • To check the boundary of an Instance: Ioccurs as an entire query in query logs
Improvements • Mine more instances • Headquartered in I => I is a city • Handle sentence duplicates: • Sentence fingerprint -> the hash of first 250 characters • Score the pairs: • Score(I, C) = Size({Pattern(I, C)})2 x Freq(I, C) • {Pattern(I, C)} – the set of patterns • Freq(I, C) – the number of appearances • Similar to tf/idf
Construction of Relation Database • TextRunner was used to extract the relations • TextRunneris a research project at the University of Washington. • It uses Conditional Random Field (CRF) to detect the relations among noun phrases. • CRF is a popular word in machine learning world: applying pre-defined feature functions to phrases to compute the final probability of a sentence (normalized score 0 ~ 1) • Example: • f(sentence, i, labeli, labeli-1) = 1 if word i is “in” and label i-1 is an adjective, otherwise 0=> Microsoft is headquartered in beautiful Redmond.
Assign Labels to Instances • Assumptions • If many instances in that column are assigned to a class, then the next instance very likely also belongs to it. • The best label is the one that is most likely to “produce” the observed values in the column. (maximum likelihood hypothesis) • Definitions • vi – value i in column • Li – possible label for that column, L(A) – the best label • U(li, V) – the score of label i after assigned to the set (V) of values
Assign Labels to Instances • According to the maximum likelihood assumption: • After applying Bayes function and normalization: • where Ks is the normalization factor • Pr[Li] -> estimated by the score in isA database • Pr[Li | vj] -> score(vj, Li) / ∑k score(vj, Lk) • Done?
Assign Labels to Instances • What if the label does not exist in the database? • What if the some popular instance – label pair has much higher score? • Final equation to compute Pr[Li | vj] using smoothing and 0 prevention • For final results, they considered only labels above certain threshold.
Results – Label Quantity and Quantity • Gold standard • Labels are manually evaluated by annotators • Vital > okay > incorrect • Allegan, Barry, Berrien –> Michigan counties (vital) • Allegan, Barry, Berrien -> Illinois counties (incorrect) • Relation quality • 128 binary relations using gold standard
Table Search Results Comparison • Results are fetched automatically but compared manually: • 100 queries, using top-5 of the results – 500 • Results were shuffled and evaluated by 3 people using single blinding test • Scores: • right on - has all information about a largenumber of instances of the class and values for the property • relevant - has information about only some of the instances, or of properties that were closelyrelated to the queried property • irrelevant • Candidates • TABLE – the method in this paper • GOOG – results from google.com • GOOGR – top 1000 results from Google intersected with the table corpus) • DOCUMENT – document-based approach
Table Search Results Comparison (a) right on, (b) right on or relevant, (c) right on or relevant and in a table