460 likes | 577 Views
Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Sources of Web information. Sources already exploited Text on pages (keyword search)
E N D
Memex: A Browsing Assistant forCollaborative Archiving andMining of Surf Trails Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari Indian Institute of Technology Bombay
Sources of Web information • Sources already exploited • Text on pages (keyword search) • Link between pages (popularity rating) • Topic taxonomies (query expansion) • Sources not exploited enough yet • Public surfing history • Public bookmarks • Collaboration is central to hypertext • Lack of trust limits collaboration on Web
Our goals • Infrastructure to support spontaneous formation of topic-based collaborative Web communities • Browsing assistant client • Community server • Mining algorithms for personal and community level topic management and collaborative resource discovery • Extensible API for plugging in additional hypertext analysis tools
2: Install the Memex applet signing certificate and visit the applet page 4: Log on to the Memex server 3: Allow the Memex client to attach to your Web browser 1: Create a Memex account (password sent by email)
Function tabs Memex client applet attaches to browser Privacy choice
Preparing to import initial bookmarks
Bookmarks imported
For Memex to suggest an initial topic organization, select all bookmarks…
…and send them to the clustering tab
Switch to the clustering tab URLs to be clustered appear here
Submit the URLs to the server-side Memex clustering demon
Check later if the server has completed the clustering task
Two top-level clusters about software and music
Expanding the software cluster to study it in more detail
User can freely reorganize URL placement using cut-and-paste
User can freely reorganize URL placement using cut-and-paste
User can freely reorganize URL placement using cut-and-paste
Moving an entire folder from the cluster tab…
…to the folder tab together with example URLs
…to the folder tab together with example URLs
Folder names can be edited as per taste; this also gives Memex additional clues about the folder’s contents
New folders can be created to hold clusters found in the cluster tab
New folders can be created to hold clusters found in the cluster tab
A topic hierarchy which is too detailed for the user can be flattened
A topic hierarchy which is too detailed for the user can be flattened
Groups of closely related URLs can be moved back to folders in the folder tab
Groups of closely related URLs can be moved back to folders in the folder tab
Memex helps the user derive a starting topic hierarchy from unstructured bookmarks
The user then continues browsing in multiple sessions. Relevant pages found by other members of the community and made public are available for collaborative surfing
If permission is granted, the Memex applet monitors the trail that the surfer follows and uploads it to the server for further analysis and mining
If permission is granted, the Memex applet monitors the trail that the surfer follows and uploads it to the server for further analysis and mining
Such surf trails together with page contents are valuable inputs to the Memex server-side hypertext mining and resource discovery demons
‘?’ indicates that Memex is not sure about the folder assignment. Users can easily correct mistakes and this forms additional valuable training data. In the background, the Memex classifier finds the most suitable folders to assign to each history items. History is never deleted (disk is cheap). When the user refreshes the view, surf history from others and herself are found categorized into the user’s familiar topic tree.
Automatic collaborative classification also lets users return to a topic-restricted surfing context quickly, and replay the last few surfing actions within that topic of interest.
Personalized topic-based history management is far superior to the one- dimensional history list provided by popular browsers
Users can switch topics with a single click, and browsing is not limited by the linear “back and forward” paradigm supported by browsers.
Users can switch topics with a single click, and browsing is not limited by the linear “back and forward” paradigm supported by browsers.
A flexible interactive search lets the user locate any page ever visited from anywhere using this account, combining content with popularity, site selections and timeliness
A flexible interactive search lets the user locate any page ever visited from anywhere using this account, combining content with popularity, site selections and timeliness
Close integration of the Memex client with the browser is non-trivial to implement but adds greatly to comfort and ease of use
Memex system diagram Browser Memex server Visit Client JAR Taxonomy synthesis Resource discovery Search Attach Recommendation Folder Download Context Classification Mining demons Running client applet Event-handler servlets Archive Clustering Relational metadata Text index Topic models Memex client-server protocol and workload sharing negotiations
X Document workflow Page visit and bookmarking events logged NODE table Browser Memex client Push new version Per-document version queue Crawler Pop and discard old version Demon Registry Search indexer Classifier service Clustering service Garbage collector
Autonomous topic organization • Bookmarks often collected into topics • Surfers use personal topic organization • One-size-fits all taxonomy inadequate • Many topics over-developed for most of us • http://dmoz.org/Sports/Hockey/Underwater_Hockey/ • But deeper interests often underdeveloped • Structure reorganization also desirable • Best taxonomy depends on community behavior as well as page content
Autonomy and collaboration • Personalization picking Yahoo nodes • Complex relations between topics • Need “simplest common ground” • Coalesce similar topics where possible… • …without sacrificing individual taste User1 User2 User3 Yahoo Cycling Sports Biz Sports Sports Shops Hiking Cycling Bikeshops Bikeshops Subsumption Tree ‘inversion’
Themes ‘Radio’ ‘Television’ ‘Movies’ Share document Share folder Share terms Taxonomy synthesis example • Generating themes makes map simpler • But distorts contents of original folders • Joint optimization gives best themes Media kpfa.org bbc.co.uk kron.com Broadcasting channel4.com kcbs.com Entertainment foxmovies.com miramax.com Studios lucasfilms.com
Summary and project status • Collaborative resource discovery and topic management system • Testbed for hypertext mining research • Signed Java2 client • Netscape 4.5+ available • IE5+ planned • Server for Unix and Windows • IBM UDB, Berkeley DB, servlets • Non-trivial to install and manage • Simple-to-use RPMs being planned • http://www.cse.iitb.ernet.in/~soumen