170 likes | 291 Views
Structural Web Search Using a Graph-Based Discovery System. Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse.uta.edu http://www-cse.uta.edu/~cook. Structured Web Search. Existing search engines use linear feature match
E N D
Structural Web Search Using a Graph-Based Discovery System Nitish Manocha, Diane J. Cook, and Lawrence B. Holder University of Texas at Arlington cook@cse.uta.edu http://www-cse.uta.edu/~cook
Structured Web Search • Existing search engines use linear feature match • Web contains structural information as well • Hyperlink information • Web viewed as a graph [Kleinberg] • Subdue searches based on structure • Use as foundation of a structural search engine • Incorporation of WordNet allows for synonym match
T1 C1 S1 T2 T3 T4 S2 S3 S4 SUBDUE • Discovers structural patterns in input graphs • A substructure is connected subgraph • An instance of a substructure is a subgraph that is isomorphic to substructure definition • Pattern discovery, classification, clustering Input Database Substructure S1 (graph form) Compressed Database triangle shape C1 S1 object R1 R1 on square S1 S1 S1 shape object
Subdue Algorithm • Start with individual vertices • Keep only best substructures on queue • Expand substructure by adding edge/vertex • Compress graph and repeat to generate hierarchical description • Optional use of background knowledge
Inexact Graph Match • Some variations may occur between instances • Want to abstract over minor differences • Difference = cost of transforming one graph to make it isomorphic to another • Match if cost/size < threshold
Application Domains • Protein data • Human Genome DNA data • Spatial-temporal domains • Earthquake data • Aircraft Safety and Reporting System • Telecommunications data • Program source code • Web data
page Represent Web as Graph • Breadth-first search of domain to generate graph • Nodes represent pages / documents • Edges represent hyperlinks • Additional nodes represent document keywords subdue texas projects word word university work hyperlink page parallel group learning robotics planning
Instructor Postscript | PDF http http Teaching Robotics Research Robotics Publication Robotics WebSubdue’s Structural Search • Formulate query as graph • Use Subdue’s predefined substructure option to search for instances of query
Query: Find all pages which link to a page containing term ‘Subdue’ • Subgraph vertices: • 1 page • URL: http://cygnus.uta.edu • 7 page • URL: http://cygnus.uta.edu/projects.html • Subdue • [1->7] hyperlink • [7->8] word Subdue word hyperlink page page /* Vertex ID Label */ s v 1 page v 2 page v 3 Subdue /* Edge Vertex 1 Vertex 2 Label */ d 1 2 hyperlink d 2 3 word
WebSubdue 22 instances AltaVista Query “host:www-cse.uta.edu AND image:next_motif.gif AND image:up_motif.gif AND image:previous_motif.gif.” 12 instances Search for Presentation Pages page hyperlink hyperlink hyperlink page page page hyperlink hyperlink
Search for Reference Pages page • Search for page with at least 35 in links • WebSubdue found 5 pages in www-cse • AltaVista cannot perform this type of search hyperlink hyperlink hyperlink … page page page
Inclusion of WordNet • When generating graph • Use common stopword list • When searching for subgraph instances • Morphology functions • October = Oct • teaching = teach • Synsets • Optional allowance of synonyms
Search for pages on ‘jobs in computer science’ • Inexact match: allow one level of synonyms • WebSubdue found 33 matches • Words include employment, work, job, problem, task • AltaVista found 2 matches page word word word jobs computer science
WebSubdue found 3 hub (and 3 authority) pages AltaVista cannot perform this type of search Inexact match applied with threshold = 0.2 (4.2 transformations allowed) WebSubdue found 13 matches page page page HUBS hyperlink page page page word word word AUTHORITIES algorithms algorithms algorithms Search for ‘authority’ hub and authority pages
word page box Subdue Learning from Web Data • Distinguish professors’ and students’ web pages • Learned concept (professors have “box” in address field) • Distinguish online stores and professors’ web pages • Learned concept (stores have more levels in graph) page page page page page page page
Conclusions • WebSubdue can be used to search for structural web data • Could be enhanced with additional WordNet features such as synset path length • Efficient structural search necessary for future of web search tools
To Learn More cygnus.uta.edu/subdue cook@cse.uta.edu http://www-cse.uta.edu/~cook