370 likes | 610 Views
How to Cha-Cha Looking under the hood of the Cha-Cha Intranet Search Engine. Marti Hearst SIMS SIMposium, April 21, 1999. This Talk. Overview of goals System implementation details Not UI evaluation related work etc. People. Principles: Mike Chen and Marti Hearst
E N D
How to Cha-ChaLooking under the hood of the Cha-Cha Intranet Search Engine Marti Hearst SIMS SIMposium, April 21, 1999
This Talk • Overview of goals • System implementation details • Not • UI evaluation • related work • etc
People • Principles: Mike Chen and Marti Hearst • Early coding: Jason Hong • Early UI evaluation: Jimmy Lin, Mike Chen • Current UI evaluation: Shiang-Ling Chen
Cha-Cha Goals • Better Intranet search • integrate searching and browsing • provide context for search results • familiarize users with the site structure • UI • minimal browser requirement • widely usable HTML interface • build on user familiarity with existing systems
Intranet Search • Documents used in a large, diverse Intranet, e.g., • University.edu • Corporation.com • Government.gov • Hypothesis: It is meaningful to group search results according to organizational structure
Cha-Cha and Source Selection • Shows available sources • Sources are major web sites • User may want to navigate the source rather than go directly to the search hits • Gives hints about relative importance of various sources • Reveals the structure of the site while tightly integrating this structure with search • Users tell us anecdotally that the outline view is useful for finding starting points
1. query 2. query Cha-Cha Cheshire 5. HTML 3. hits System Overview • Collect shortest paths for each page. • Global paths: from root of the domain • Local paths: from root of the server • Select “the best” path based on the query • User interaction with the system: 4. select paths & generate HTML
Current Status • Over 200,000 pages indexed • About 2500 queries/weekday • Less than 3 sec/query on average • Five subdomains using it as site search engine • eecs • millennium project • sims • law • career center
Overview of Cha-Cha Preprocessing • Crawl entire Intranet • Store copies of pages locally • 200,000 pages on the UCB Intranet • Revisit all the pages again (on disk) • Create metadata for each page • Compute the shortest hyperlink path from a certain root page to every web page • both global and local paths • Index all the pages • Using Cheshire II (Ray Larson, SIMS) • Index full text, titles, shortest paths separately
Web Crawling Algorithm • Start with a list of servers to crawl • for UCB, simply start with www.berkeley.edu • Restrict crawl to certain domain(s) • *.berkeley.edu • Obey No Robots standard • Follow hyperlinks only • do not read local filesystems • links are placed on a queue • traversal is breadth-first
Web Crawling Algorithm (cont.) • Interpret the HTML on each web page • Record the text of the page in a file on disk. • Make a list of all the pages that this page links to (outlinks) • Follow those links one at a time, repeating this procedure for each page found, until no unexplored pages are left. • links are placed on a queue • traversal is breadth-first • urls that have been crawled are stored in a hash table in memory, to avoid repeats
Custom Web Crawler • Special considerations • full coverage • web search engines don’t go very deep • web search engines skip problematic sites • search on “Berdahl” at snap: 430 hits • search on “Berdahl” on Cha-Cha: XXX hits • solution • tag each URL with a retry counter • if server is down, put URL at the end of the queue and decrement the retry counter • if the counter is 0, give up on the URL
Custom Web Crawler • Special considerations • servers with multiple names • info.berkeley.edu == www.sims.berkeley.edu • solution: • hash the home page of the server into a table • whenever a new server is found, compare its homepage to those in the table • if a duplicate, record the new server’s name as being the same as the original server’s
Cha-Cha Metadata • Information about web pages • Title • Length • Inlinks • Outlinks • Shortest paths from a root home page
Metafile Generator • Main task: find shortest path information • Two passes: global and local • Global pass: • start with main home page H (www.berkeley.edu) • find shortest path from H to every page in the system • for each page, keep track of how far it is from H • also keep track of the path that got you there • store this information in a disk-based storage manager (we use sleepycat, based on Berkeley db) • if a page is re-encountered using a path with a shorter distance, record that distance and the new path • when this is done, write out a metafile for each page
Metafile Generator (cont.) • Local pass: • start with a list of all the servers found during the crawl • for each server S • find shortest path from S to every page in the system • do this the same way as in the global pass but store the results in a different database • when done, write out a metafile for each page, in a different directory than for the global pass
Metafile Generator (cont.) • Combine local and global path information • Purpose: • locality should “trump” global paths, but not all local pages are reachable locally • example: • the shortest path from www.berkeley.edu to www.sims.berkeley.edu/~hearst is: www.berkeley.edu -> search.berkeley.edu -> cha-cha.berkeley.edu -> www.sims.berkeley.edu/~hearst • but we want my home page to be under the SIMS faculty listing • solution: let local trump global • example:
Metafile Generator (cont.) • Combine local and global path information • How to do it: • go through the metafiles in the global directory • for each metafile • if there already is a metafile for that url in the local directory, skip this metafile • otherwise (there is not metafile for this url locally) copy the metafile into the local directory • Why not just use local metafiles? • some pages are not linked to within their own domain • e.g., student association hosted within a particular student’s domain
Sample Cha-Cha Metadata file <METAFILE> <Url>http://www.sims.berkeley.edu/</Url> <Title>Welcome to SIMS</Title> <Date>null</Date> <Size>4865</Size> <!-- INLINKS --> <InlinkCount>1</InlinkCount> <Inlinks>http://www-resources.berkeley.edu/nhpteaching/</Inlinks> <!-- OUTLINKS --> <OutlinkCount>21</OutlinkCount> <Outlinks>http://www.sims.berkeley.edu/about.html http://www.sims.berkeley.edu/search.html http://www.sims.berkeley.edu/events/conferences/ http://www.sims.berkeley.edu/resources/sites.html http://www.sims.berkeley.edu/people/masters.html
Cha-Cha Metadata File, cont. <!-- SHORTEST_PATHS --> <Depth>2</Depth> <ShortestPathsCount>1</ShortestPathsCount> <ShortestPaths>Welcome to UC Berkeley http://www.berkeley.edu/ UC Berkeley Teaching Units http://www-resources.berkeley.edu/nhpteaching/ </ShortestPaths> <!-- MIRROR URLS --> <MirrorCount>0</MirrorCount> <!-- DATA_FILE --> <File>/projects/cha-cha/development/data/done/text/ www.sims.berkeley.edu/index.html</File> </METAFILE>
CHESHIRE II • Search back-end for Cha-Cha • Ray Larson et al. ASIS 95, JASIS 96 • CHESHIRE II system: • Full Service Full Text Search • Client/Server architecture • Z39.50 IR protocol • Interprets documents written in SGML • Probabilistic Ranking • Flexible data representation
CHESHIRE II (cont.) • A big advantage of Cheshire: • don’t have to write a special parser for special document types • instead, simply create one DTD and the system takes care of parsing the metafiles for us • A related advantage: • can create indexes on individual components of the document • allows efficient title search, home page search, domain-based search, without extra programming
Cha-Cha Document Type Definition <!SGML "ISO 8879:1986" -- -- CHARSET BASESET "ISO 646:1983//CHARSET International Reference Version (IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2 9 11 2 UNUSED 13 1 13 14 18 UNUSED 32 95 32 127 1 UNUSED BASESET "ISO Registration Number 100//CHARSET ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128 32 UNUSED 160 95 32 255 1 UNUSED
Cha-Cha DTD, cont. (parts omitted) <!doctype METADATA [ <!-- This is a DTD for metadata records extracted from the HTML files in the cha-cha system. The tagging is simple with nothing particular about it. The structure has been kept flat within the individual records. The only somewhat interesting thing is the TEXT-REF tag which is used to contain a reference to the full text of entry stored in the raw HTML form. --> <!ELEMENT METADATA o o (METAFILE*)>
<!-- We allow most elements to occur any number of times in any order --> <!-- this is because there is little consistency in the actual usage. --> <!ELEMENT METAFILE - - (URL, TITLE, DATE, SIZE, INLINKCOUNT, INLINKS, OUTLINKCOUNT, OUTLINKS, DEPTH?, SHORTESTPATHSCOUNT?, SHORTESTPATHS?, MIRRORCOUNT?, MIRRORURLS?, TYPE?, DOMAIN?, FILE?)> <!-- We won't make any assumptions about content... all PCDATA --> <!ELEMENT URL - o (#PCDATA)> <!ELEMENT DATE - o (#PCDATA)> <!ELEMENT TITLE - o (#PCDATA)> <!ELEMENT SIZE - o (#PCDATA)> <!ELEMENT INLINKCOUNT - o (#PCDATA)> <!ELEMENT INLINKS - o (#PCDATA)> <!ELEMENT OUTLINKCOUNT - o (#PCDATA)> <!ELEMENT OUTLINKS - o (#PCDATA)> <!ELEMENT DEPTH - o (#PCDATA)> <!ELEMENT SHORTESTPATHSCOUNT - o (#PCDATA)> <!ELEMENT SHORTESTPATHS - o (#PCDATA)> Cha-Cha DTD, cont. (parts omitted)
Responding to the User Query • User searches on “pam samuelson” • Search Engine looks up documents indexed with one or both terms in its inverted index • Search Engine looks up titles and shortest paths in the metadata index • User Interface combines the information and presents the results as HTML
Building the Outline View • Main issue: how to combine shortest paths • There are approximately three shortest paths per web page • We assume users do not want to see the page multiple times • Strategy: • Group hits together within the hierarchy • Try to avoid showing subhierarchies with singleton hits • This assumption is based on part on evidence from our earlier clustering research that relevant documents tend to cluster near one another
Building the Outline View (cont.) • Goals of the algorithm: • (I) Group (recursively) as many pages together within a subhierarchy as possible • Avoid (recursively) branches that terminate in only one hit (leaf) • (II) Remove as many internal nodes as possible while while stil retaining at least one valid path to every leaf • (iii) Remove as many edges as possible while retaining at lesat one path to every leaf
Building the Outline View (cont.) • To achieve these goals we need a non-standard graph algorithm • To do it properly, every possible subset of nodes at depth D should be considered to determine the minimal subset which covers all nodes at depth D+1 • This is inefficient -- would require 2^k checks for k nodes at depth D • Instead, we use a heuristic approach which approximates the optimal results
Building the Outline View (cont.) • First, a top-down pass • record depth of each node and the number of children it links to directly • Second, a bottom-up pass • identify the deepest nodes (the leaves) • D <- the set of nodes that are parents of leaves • Sort D ascending according to how many active children they link to at depth D+1 • A node is active if it has not been eliminated
Building the Outline View (cont.) • Bottom-up pass, continued • every node is a candidate to be eliminated • those nodes with the least number of children are eliminated first • because of goal (I) • for each candidate C, if C links to one or more active nodes at depth D+1 that are not covered by any active nodes, then C cannot be eliminated. Otherwise, C is removed from the active list • After a level D is complete, there are no active nodes at depth D that cover exclusively nodes that are also covered by another node at depth D
Building the Outline View (cont.) • Retaining rank ordering • Build up the tree by first placing in the tree the hit (leaf) that is highest ranked • As more leaves are added, more parts of the hierarchy are added, but the order in which the parts of the hierarchy are added is retained • When the hierarchy has been built, it is traversed to create the HTML listing
Summary • Better user interfaces for search should: • Help users understand starting points/sources • Places results of search into an organizing context • One (of many) approaches • Cha-Cha: simultaneously browse and search intranet site context • Future work • Special handling for short queries • Spelling corrections suggestions • Smarter paths