180 likes | 254 Views
Growing Parallel Paths for Entity-Page Retrieval. Tim Weninger , Cindy Xide Lin, and Jiawei Han. Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10. Problem: Entity Page Retrieval. Given: Reference page.
E N D
Growing Parallel Paths for Entity-Page Retrieval Tim Weninger, Cindy Xide Lin, and Jiawei Han Department of Computer Science University of Illinois Urbana-Champaign, Urbana, IL Work Submitted to VLDB'10
Problem: Entity Page Retrieval Given: Reference page
Problem: Entity Page Retrieval …Can We find Entity Pages of the same Type?
Problem: Entity Page Retrieval …Can We find Entity Pages of the same Type?
Definitions: Defn 1: Root to link path: ◊ - hrefX contains HTML-TABLE-TR1—TD-hrefX Defn 2: Parallel Links: Share a root to link path. i.e., lists of links Defn 3: Intra-page parallel paths: ◊ - hrefCǁ ◊ - hrefB ◊ - hrefCǁ ◊ - hrefX
Definitions: Defn 5: Parallel Web site paths Share intra or inter-page parallel paths across multiple pages Defn 4: Inter-page parallel ◊ - hrefCin Page A ǁ ◊ - hrefWin Page B
Properties of Parallel Paths Prop. 1: Equal Path Length Property: Parallel paths must contain the same number of pages. Prop. 2: Parallel Page Property: The test of two paths being in parallel is equivalent to the result of tests of respective pages. Prop. 3: Equal Page Length Property: Parallel paths must have the same number of nodes across pages.
Properties of Parallel Paths Prop. 4: Divergent Path Property: Parallel Paths can extend through separate pages Prop. 5: Early Termination Property: The test of two paths can be terminated at the first occurrence of a dissimilar node
Finding Paths Naive Method Can be very costly Growing Parallel Paths First find example path Then grow paths which are in parallel to the example Repeat with alternate paths This makes magic happen
Repeating with alternate paths k-shortest paths Do k-shortest path search. Explore all of these paths Removing links After exploring a path remove the edges from the graph
Interpreting the Output Side Effect of Repeating with Alternate paths Given: Jiawei Han Result: Jiawei Han 40 Cheng Zhai 38 Kevin Chang 38 Dan Roth 32 VikramAdve 4 Roy Campbell 3 …
Interpreting the Output Side Effect of Path Finding What does the link labels on the path tell us about the entity First path People Faculty Jiawei Han Personal Site Second path Research Data Mining
Experiments Top 25 CS Departments in US (according to US News) Find all professors United States Congress Find all senators, representatives, and committees UIUC only Find all courses Final all research groups Baseline Google’s find similar search (essentially TFIDF-type ranking)
Conclusions and Future Work Given a reference page and an example entity type we can retrieve all entity pages of the same type Implications: We can use this for information integration Search, retrieval can be enhanced Shortcomings: Most errors due to incorrect list finding