430 likes | 635 Views
Web Mining. Two Key Problems. Page Rank Web Content Mining. PageRank. Intuition : solve the recursive equation: “a page is important if important pages link to it.” Maximailly: importance = the principal eigenvector of the stochastic matrix of the Web. A few fixups needed.
E N D
Two Key Problems • Page Rank • Web Content Mining
PageRank • Intuition: solve the recursive equation: “a page is important if important pages link to it.” • Maximailly: importance = the principal eigenvector of the stochastic matrix of the Web. • A few fixups needed.
Stochastic Matrix of the Web • Enumerate pages. • Page i corresponds to row and column i. • M [i,j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. • M [i,j ] is the probability we’ll next be at page i if we are now at page j.
Example Suppose page j links to 3 pages, including i j i 1/3
Random Walks on the Web • Suppose v is a vector whose ith component is the probability that we are at page i at a certain time. • If we follow a link from i at random, the probability distribution for the page we are then at is given by the vector M v.
Random Walks --- (2) • Starting from any vector v, the limit M (M (…M (M v) …)) is the distribution of page visits during a random walk. • Intuition: pages are important in proportion to how often a random walker would visit them. • The math: limiting distribution = principal eigenvector of M = PageRank.
Example: The Web in 1839 y a m Yahoo y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M’soft
Simulating a Random Walk • Start with the vector v = [1,1,…,1] representing the idea that each Web page is given one unit of importance. • Repeatedly apply the matrix M to v, allowing the importance to flow like a random walk. • Limit exists, but about 50 iterations is sufficient to estimate final distribution.
Example • Equations v = M v: y = y /2 + a /2 a = y /2 + m m = a /2 y a = m 1 1 1 1 3/2 1/2 5/4 1 3/4 9/8 11/8 1/2 6/5 6/5 3/5 . . .
Solving The Equations • Because there are no constant terms, these 3 equations in 3 unknowns do not have a unique solution. • Add in the fact that y +a +m = 3 to solve. • In Web-sized examples, we cannot solve by Gaussian elimination; we need to use relaxation (= iterative solution).
Real-World Problems • Some pages are “dead ends” (have no links out). • Such a page causes importance to leak out. • Other (groups of) pages are spider traps (all out-links are within the group). • Eventually spider traps absorb all importance.
Microsoft Becomes Dead End y a m Yahoo y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 Amazon M’soft
Example • Equations v= M v: y = y /2 + a /2 a = y /2 m = a /2 y a = m 1 1 1 1 1/2 1/2 3/4 1/2 1/4 5/8 3/8 1/4 0 0 0 . . .
M’soft Becomes Spider Trap y a m Yahoo y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 Amazon M’soft
Example • Equations v= M v: y = y /2 + a /2 a = y /2 m = a /2 + m y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 0 0 3 . . .
Google Solution to Traps, Etc. • “Tax” each page a fixed percentage at each interation. • Add the same constant to all pages. • Models a random walk with a fixed probability of going to a random place next.
Example: Previous with 20% Tax • Equations v = 0.8(M v ) + 0.2: y = 0.8(y /2 + a/2) + 0.2 a = 0.8(y /2) + 0.2 m = 0.8(a /2 + m) + 0.2 y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .
General Case • In this example, because there are no dead-ends, the total importance remains at 3. • In examples with dead-ends, some importance leaks out, but total remains finite.
Solving the Equations • Because there are constant terms, we can expect to solve small examples by Gaussian elimination. • Web-sized examples still need to be solved by relaxation.
Speeding Convergence • Newton-like prediction of where components of the principal eigenvector are heading. • Take advantage of locality in the Web. • Each technique can reduce the number of iterations by 50%. • Important --- PageRank takes time!
Web Content Mining • The Web is perhaps the single largest data source in the world. • Much of the Web (content) mining is about • Data/information extraction from semi-structured objects and free text, and • Integration of the extracted data/information • Due to the heterogeneity and lack of structure, mining and integration are challenging tasks.
Wrapper induction • Using machine learning to generate extraction rules. • The user marks the target items in a few training pages. • The system learns extraction rules from these pages. • The rules are applied to extract target items from other pages. • Many wrapper induction systems, e.g., • WIEN (Kushmerick et al, IJCAI-97), • Softmealy (Hsu and Dung, 1998), • Stalker (Muslea et al. Agents-99), • BWI (Freitag and McCallum, AAAI-00), • WL2 (Cohen et al. WWW-02). • IDE (Liu and Zhai, WISE-05) • Thresher (Hogue and Karger, WWW-05)
Stalker: A wrapper induction system (Muslea et al. Agents-99) E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570 E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293 E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008 We want to extract area code. • Start rules: R1: SkipTo(() R2: SkipTo(-<b>) • End rules: R3: SkipTo()) R4: SkipTo(</b>)
Learning extraction rules • Stalker uses sequential covering to learn extraction rules for each target item. • In each iteration, it learns a perfect rule that covers as many positive items as possible without covering any negative items. • Once a positive item is covered by a rule, the whole example is removed. • The algorithm ends when all the positive items are covered. The result is an ordered list of all learned rules.
Rule induction through an example Training examples: E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570 E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293 E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008 We learn start rule for area code. • Assume the algorithm starts with E2. It creates three initial candidate rules with first prefix symbol and two wildcards: • R1: SkipTo(() • R2: SkipTo(Punctuation) • R3: SkipTo(Anything) • R1 is perfect. It covers two positive examples but no negative example.
Rule induction (cont …) E1: 513 Pico, <b>Venice</b>, Phone 1-<b>800</b>-555-1515 E2: 90 Colfax, <b>Palms</b>, Phone (800) 508-1570 E3: 523 1st St., <b>LA</b>, Phone 1-<b>800</b>-578-2293 E4: 403 La Tijera, <b>Watts</b>, Phone: (310) 798-0008 • R1 covers E2 and E4, which are removed. E1 and E3 need additional rules. • Three candidates are created: • R4: SkiptTo(<b>) • R5: SkipTo(HtmlTag) • R6: SkipTo(Anything) • None is good. Refinement is needed. • Stalker chooses R4 to refine, i.e., to add additional symbols, to specialize it. • It will find R7: SkipTo(-<b>), which is perfect.
Limitations of Supervised Learning • Manual Labeling is labor intensive and time consuming, especially if one wants to extract data from a huge number of sites. • Wrapper maintenance is very costly: • If Web sites change frequently • It is necessary to detect when a wrapper stops to work properly. • Any change may make existing extraction rules invalid. • Re-learning is needed, and most likely manual re-labeling as well.
The RoadRunner System(Crescenzi et al. VLDB-01) • Given a set of positive examples (multiple sample pages). Each contains one or more data records. • From these pages, generate a wrapper as a union-free regular expression (i.e., no disjunction). The approach • To start, a sample page is taken as the wrapper. • The wrapper is then refined by solving mismatches between the wrapper and each sample page, which generalizes the wrapper.
Compare with wrapper induction • No manual labeling, but need a set of positive pages of the same template • which is not necessary for a page with multiple data records • not wrapper for data records, but pages. • A Web page can have many pieces of irrelevant information. Issues of automatic extraction • Hard to handle disjunctions • Hard to generate attribute names for the extracted data. • extracted data from multiple sites need integration, manual or automatic.
Relation Extraction • Assumptions: • No single source contains all the tuples • Each tuple appears on many web pages • Components of tuple appear “close” together • Foundation, by Isaac Asimov • Isaac Asimov’s masterpiece, the <em>Foundation</em> trilogy • There are repeated patterns in the way tuples are represented on web pages
Naïve approach • Study a few websites and come up with a set of patterns e.g., regular expressions letter = [A-Za-z. ] title = letter{5,40} author = letter{10,30} <b>(title)</b> by (author)
Problems with naïve approach • A pattern that works on one web page might produce nonsense when applied to another • So patterns need to be page-specific, or at least site-specific • Impossible for a human to exhaustively enumerate patterns for every relevant website • Will result in low coverage
Better approach (Brin) • Exploit duality between patterns and tuples • Find tuples that match a set of patterns • Find patterns that match a lot of tuples • DIPRE (Dual Iterative Pattern Relation Extraction) Match Patterns Tuples Generate
DIPRE Algorithm • R Ã SampleTuples • e.g., a small set of <title,author> pairs • O Ã FindOccurrences(R) • Occurrences of tuples on web pages • Keep some surrounding context • P Ã GenPatterns(O) • Look for patterns in the way tuples occur • Make sure patterns are not too general! • R Ã MatchingTuples(P) • Return or go back to Step 2
Web query interface integration • Many integration tasks, • Integrating Web query interfaces (search forms) • Integrating extracted data • Integrating textual information • Integrating ontologies (taxonomy) • … • We only introduce integration of query interfaces. • Many web sites provide forms to query deep web • Applications: meta-search and meta-query
Global Query Interface united.com airtravel.com delta.com hotwire.com
S1: author title subject ISBN S2: writer title category format S3: name title keyword binding V.S. Pairwise Attribute Correspondence S1.author « S3.name S1.subject « S2.category Synonym Discovery (He and Chang, KDD-04) • Discover synonym attributes Author – Writer, Subject – Category S1: author title subject ISBN S2: writer title category format S3: name title keyword binding Holistic Model Discovery category author writer name subject
Schema matching as correlation mining Across many sources: • Synonym attributes are negatively correlated • synonym attributes are semantically alternatives. • thus, rarely co-occurin query interfaces • Grouping attributes with positive correlation • grouping attributes semantically complement • thus, often co-occur in query interfaces