140 likes | 283 Views
Reducing Human Interactions in Web Directory Searches. ORI GERSTEL - Cisco SHAY KUTTEN - Technion EDUARDO SANY LABER - PUC-Rio RACHEL MATICHIN and DAVID PELEG - The Weizmann Institute of Science ARTUR ALVES PESSOA - UFF CRISTON SOUZA - PUC-Rio. The big problem.
E N D
Reducing Human Interactions in WebDirectory Searches • ORI GERSTEL - Cisco • SHAY KUTTEN - Technion • EDUARDO SANY LABER - PUC-Rio • RACHEL MATICHIN and DAVID PELEG - The Weizmann Institute of Science • ARTUR ALVES PESSOA - UFF • CRISTON SOUZA - PUC-Rio
The big problem Find information in a large collection of data. There are two basic ways to handle information-finding. One is a “flat” approach which views the information as a nonhierarchical structure and provides a query language to extract the relevant data from the database. An example of this approach on the web is the Google search engine. The other method is based on a hierarchical index to the database according to a taxonomy of categories. Example of such index on the web is Yahoo.
The “small” problem Create the optimal hotlinks for the hierarchical approach. In this approach it is necessary to traverse a path in the taxonomy tree from the root to the desired node in the tree. Human engineering considerations further aggravate this problem, since it is very hard to choose an item from a long list (typical convenient numbers are 7–10). Thus, the degree of the taxonomy tree should be rather low and its average depth, therefore, high. Another problem in the hierarchical approach is that the depth of an item in the taxonomy tree is not based on the access pattern. As a result, items which have very high access frequency may require long access paths each time they are needed, while those which are “unpopular” may still be very accessible in the taxonomy tree. It is desirable to find a solution that does not change the taxonomy tree itself, since this taxonomy is likely to be meaningful and useful for the user
The solution A partial solution to this problem is currently used in the web, and consists of a list of “hot” pointers which appears in the top level of the index tree and leads directly to the most popular items. We refer to a link from a hotlist to its destination as a hotlink. This approach is not scalable in the sense that only a small number of items can appear in such a list In the current article we study the generalization of this “hotlist” This approach allows us to have such lists in multiple levels in the index tree, not just the top level. The resulting structure is termed a hotlink-enhanced index structure (or enhanced structure for short). This article also addresses the optimization problem faced by the index designer, namely, to find a set of hotlinks that minimizes the expected number of links (either tree edges or hotlinks) traversed by a user from the root to a leaf.
The solution cont... There are many applications for such hotlink-enhanced index structures. A partial list includes: —a web index (such as Yahoo), with multilevel hotlists in which the access statistics is influenced by accesses of all users to various sites; —a personalized web index in which the browser records personalized statistics; —large library index systems. —Application menu trees are currently designed by the application developer, or statically customized to the needs of a user. By adding hotlists, the application can learn the usage pattern of the user and adjust to changes in this pattern. —It is even possible to use the preceding idea in file systems, in which files in the static tree structure of the file system can be augmented by links to frequently accessed subdirectories or files.
Assumptions The model considered here is based on the assumption that the user does not have knowledge of the enhanced index structure. This forces the user to deploy a greedy strategy. In this user model, the user starts at the root of the tree and, at each node in the enhanced structure, can infer from the link labels which tree edges or hotlinks available at the current page lead to a page that is closest in the original tree structure to the desired destination. The user always chooses that link. In other words, the user is not aware of other hotlinks that may exist from other pages in the tree. This means that the user’s estimate for the quality of a tree edge or hotlink leading from the root to some node is based solely on the height of that node in the original tree.
The algorithm Given a tree T with n nodes representing an index, a hotlink is an edge that does not belong to the tree. The hotlink starts at some node v and ends at (or leads to) some node u that is a descendant of v. We assume without loss of generality that u is not a child of v. Each leaf x of T has a weight p(x), representing the proportion of the user’s visits to that leaf compared with the total set of user’s visits. Hence, if normalized, p(x) can be interpreted as the probability that a user wants to access the leaf x. Another parameter of the problem is an integer K, specifying an upper bound on the number of hotlinks that may start at any given node (there is no a priori limiton the number of hotlinks that lead to a given node).
The algorithm Let S be a set of hotlinks constructed on the tree (obeying the bound of K outgoing hotlinks per node) and let DS(v) denote the greedy path (including hotlinks) from the root to node v. The expected number of operations needed to get to an item is The problem of optimizing this function is referred to as the hotlink enhancement problem. Two different static problems arise, according to whether the probability distribution p is known to us in advance. Assuming a known distribution, our goal is to find a set of hotlinks S which minimizes f (T, p, S) and achieves the optimal cost Such a set is termed an optimal set of hotlinks.
The algorithm cont... On the other hand, under the unknown distribution assumption, the worst-case expected access cost for a tree T with a set of hotlinks S is and our goal is to find a set of hotlinks S minimizing f (T, S) and achieving the optimal cost
The algorithm cont... The algorithm uses dynamic programming and the greedy assumption to limit the search operations. The authors also show how to generalize the solution to arbitrary degree trees and to hotlink enhancement schemes that allow up to K hotlinks per node. The algorithm can be used for trees with unbounded degree. For an input tree T with n nodes, our algorithm runs in time, requiring space. Thus, it runs in polytime for trees of logarithmic depth.
Experiments The motivation is twofold: 1. to show that adding hotlinks to websites indeed provides considerable gains in terms of the user-expected path length; 2. to argue that the algorithm proposed in this work for the hotlink enhancement problem (with known probabilities) allows a practical implementation. All of our experiments were executed in a Xeon 2.4 GHz machine with 1GB of RAM memory. In addition, we allowed at most 1 hotlink leaving each node in all experiments.
Results In order to have a clear idea of the improvement produced by an algorithm, we present our results in terms of the gain metric. The gain of an algorithm for an instance (T, p) is given by ( f (T, p, ∅)− f (T, p, S))/ f (T, p, ∅), where S is the set of hotlinks that algorithm adds to the tree T. Hotlink Assignment to the puc-rio.br Domain. Of special interest in the instance obtained from PUC’s domain is the fact that it has actual access probabilities. We note that this instance is one of the biggest that we tested, with 17,379 nodes and H = 8. The optimal solution obtained by the algorithm provides a gain of 18.62%. The algorithm executed in less than 2 seconds and consumed around 10MB of memory space.
Results Hotlink Assignment to Actual Websites with Zipf’s Distribution. A critical aspect of the algorithm is the use of memory for the dynamic programming table. With 23MB, the algorithm provided the optimal solution of 82 out of 84 instances. The 2 hard instances (not solved with 23MB) were those with the greatest heights. One of the hard instances has 10,484 nodes and height 16, requiring 488MB. The other instance has 512,484 nodes and height 14, requiring more than 1GB.
Conclusion We have presented new exact algorithms for several variations of the problem of assigning hotlinks to hierarchically arranged data such as in web directories. For most of these variations, we have proved that the proposed algorithms run efficiently from a theoretical point of view. In the case of one of the algorithms, the running time is polynomial if the depth of the tree is logarithmic. We have run some experiments to evaluate both the efficiency and efficacy of the algorithm that solves the problem for known distributions and at most one hotlink leaving each node. These experiments show that significant improvement in the expected number of accesses per search can be achieved in websites using this algorithm. In addition, the proposed algorithm consumed a reasonable amount of computational resources to obtain optimum hotlink assignments.