270 likes | 421 Views
Detecting Sequences and Cycles of Web Pages. Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata. Contents. Introduction Objective Significance Procedure Experiments Future directions. The Web: A Directed Graph. (V, A) Vertices Web pages
E N D
Detecting Sequences and Cycles of Web Pages Narayan L. Bhamidipati and Sankar K. Pal Indian Statistical Institute Kolkata
Contents • Introduction • Objective • Significance • Procedure • Experiments • Future directions
The Web: A Directed Graph • (V, A) • Vertices Web pages • V = {v1, v2, …, vN} • Arcs Hyperlinks • A = {eij : vj vi} • Path: p1.p2. … .pn with arcs from pi to pi+1 • Cycle: A Path with pn = p1
Sequences of Web Pages • Paths consisting of adjacent web pages • Order sensitive • A surfer may follow one such sequence when browsing pages
Cycles of Web Pages • http://www.stanford.edu/ • http://www.stanford.edu/home/atoz/letterw.html • http://www.stanford.edu/group/wellspring/ • http://www.stanford.edu/group/wellspring/yahoo_spotlight.html • http://www.yahoo.com/ • http://dir.yahoo.com/Education/ • http://dir.yahoo.com/Education/Higher_Education/ • http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/ • http://dir.yahoo.com/Education/Higher_Education/Colleges_and_Universities/United_States/ • http://www.stanford.edu
What are we looking for ? • A particular kind of sequences and cycles • Regular • Consisting of similar units • Units having similar relationship • Reasonably sized
Why are these Sequences and Cycles Interesting ? • Individual units form a single object • These were intended to be together • They collectively include the complete information • Despite being part of a collection, individuality is maintained
Significance of Detecting Such Sequences and Cycles • Compression • Merge groups of pages • Fewer pages fewer links • Pre-fetching • Know where the surfer wants to be next • Fetch the page(s) before being requested • Saves time • Errors: pre-fetching wrong pages
Significance of Detecting Such Sequences and Cycles (Contd.) • Fair comparison • Comparison independent of how content is presented • Content split into multiple pages should be treated equivalent to the same in a single page • Better retrieval • Retrieval independent of the presentation • Output a set of pages instead of a single one as a match
Improved Retrieval • Retrieve only portions of interest • Instead of, whole (huge) documents • Avoid rewarding more content
How to Detect Sequences and Cycles of Web Pages ? • Find navigational links • Find consecutive pages • Define what the elements of the sequence would satisfy • Identify subsequences (or units) • Concatenate • Check for cycles
Finding Navigational Links: Background • The purpose of a link may be • Navigation • Reference • Advertisement • Links between pages on the same server are treated as navigational • Have also been treated as noise
Finding Navigational Links: Our Method • Avoid treating links on the same server as navigational links • Appear mostly either at the top or at the bottom • Navigational links are generally huddled together • Fewer text and images around such links
Advantages and Limitations • Simple and fast • Navigational links across servers are also identified • Heuristics need not always work – fall back on sophisticated methods
Units of the Sequences • ABC is a unit if C is “related” to B in the same way as B is “related” to A • “related” is defined in terms of how they are linked • Relation is stored as “position” of the link • Several ways of defining “position”
Combining the units into sequences • DEF • BCD • ABC • CDE • ABCDEF
Cycle detection • Existing cycle detection algorithms • Cycle detection in number theory • Special case of cycle detection in graph theory • Stack based algorithm
Improvements and Speedups • Believe the “rel” information provided by the (author of the) pages • Use keywords like “next” and “previous” to perceive the relationships • Utilize the information of the naming convention
Experimental Results • Data • Toy data: python tutorial in HTML • Tutorial split into several chapters and sections • Several cycles • Mutilated data • Certain pages deleted (missing links) • 100% detection in all cases
Other experiments planned • Real test: unorganized web pages • Difficulties: • Finding navigational links • Noise (advertisements, etc) • Dynamically generated • Will the relationships hold ?
Leads us to … • Concatenate detected sequences for analysis • Modify retrieval mechanism • Return sets of pages as results • Improve mirror/duplicate detection
Future Work • Consider other relations • Unifying framework ? • Improve identification of navigational links