100 likes | 259 Views
Web Mining for Extracting Relations. Negin Nejati. Relation Extraction. (James Gleick, Chaos: Making a New Science) (Charles Dickens, Great Expectations) (William Shakespeare, The Comedy of Errors) (Isaac Asimov, The Robots of Dawn) (David Brin, Startide Rising) (author, title).
E N D
Web Mining for Extracting Relations Negin Nejati
Relation Extraction (James Gleick, Chaos: Making a New Science) (Charles Dickens, Great Expectations) (William Shakespeare, The Comedy of Errors) (Isaac Asimov, The Robots of Dawn) (David Brin, Startide Rising) (author, title)
DIPRE Algorithm S = SampleTuples While size(S) < T O = FindOccurrences(S) P = GenPatterns(O) S = MatchingTuples(P)
Pattern Generation • Existing methods assume components of tuple appear close together (e.g.” Foundation, by Isaac Asimov”) • This is a very strong assumption. (e.g. misses all the titles in the author’s webpage). • Non-popular relations with limited source of data suffer more. (for some relations this is not the typical appearance, e.g. (service, price))
Using Heuristics • We are looking for (author, title) pairs. • It is very likely that the works of an author are presented as lists or tables. • Such tables usually have helpful titles such as: bibliography,selected work, novels, stories, etc.
New Algorithm Great Expectations Charles Dickens occurrences
New Algorithm Group occurrences using edit distance and generate patterns: <LI><I><A HREF="/WIKI/CHAOS:_MAKING_A_NEW_SCIENCE" TITLE="title">title</A></I> (VIKING PENGUIN, 1987)</LI> & <LI><I><A HREF="/WIKI/GREAT_EXPECTATIONS“ TITLE="title">title</A></I> (1860Â1861)</LI> [<LI><I><A HREF="/WIKI/, “TITLE="title">title</A></I> (, )</LI>]
Pattern Generation (An Alternative) • [Charles Dickens • James Gleick • William Shakespeare • ….] • “List of authors” New titles Run patterns on result pages New authors
Results • DIPRE 5 seeds 3 patterns 4047 pairs • The proposed algorithm 5 seeds 2 patterns 2596 pairs
Further Investigations • Study the effects of including the titles of the lists and tables in the patterns. • Study the qualitative differences of these two methods.