200 likes | 396 Views
Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Information sources on the Web. Web page contents Early keyword search engines Hyperlink structure
E N D
Using Memex to archive and mine community Web browsing experience Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari Indian Institute of Technology Bombay
Information sources on the Web • Web page contents • Early keyword search engines • Hyperlink structure • Later engines: Google, Raging Search • Searching behavior • Search site monitor clicks on search results • Browsing behavior • Easily captured in stand-alone hypermedia • Need software infrastructure for the Web
Personal Memex • Archiving is feasible • ~25 GB in a lifetime • Why archive? • Recall past events • Create a ‘profile’ • Correlate with sites, directories, searches • Challenges • Flexible architecture • Analyses techniques Your husband died,but here is his Memex (From Jim Gray’s Turing Award Lecture)
Searching the personal Memex • Keyword search (never lose a page) • Advanced queries • Recreate my recent surfing history w.r.t. the topic ‘bicycling’ • Extract from the MIT Web site all pages that match my ‘compiler research’ profile • Topic taxonomy plays a central role • Characterized by bookmark folders • More familiar than ‘universal’ directories
Archiving architecture choices • Bookmarks only or all click history • Installed application or plug-in • Closer integration, e.g. with COM • CGI and Javascript • Slow, hard to monitor all clicks • Applet-servlet • Portable, better UI compared to HTML • Proxy or wiretap • Proxy involves configuring browser
Memex block diagram Browser Memex server Visit Client JAR Taxonomy synthesis Resource discovery Search Attach Recommendation Folder Download Context Classification Mining demons Running client applet Event-handler servlets Archive Clustering Relational metadata Text index Topic models Memex client-server protocol and workload sharing negotiations
X Document workflow Page visit and bookmarking events logged NODE table Browser Memex client Push new version Per-document version queue Crawler Pop and discard old version Demon Registry Search indexer Classifier service Clustering service Garbage collector
Folder tab • Valuable user input and feedback on topics and example documents User cuts and pastes to correct or reinforce the Memex classifier ‘?’ indicates automatic placement by Memex classifier File manager- like interface Privacy choice
Context tab Replay of recent browsing context restricted to chosen topic Choice of topic context Better mobility than one- dimensional history provided by popular browsers Active browser monitoring and dynamic layout of new incremental context graph
Search using keyword and visit statistics Search tab • “Find the paper about collaborative filtering I was reading a month back”
Mining issues • Two relations • occurs_in(term, document) • bookmarked_into(document, folder) • (Ignore hyperlinks for now) • Document classification and clustering • Exploit ‘bookmarked_into’ • Taxonomy synthesis • Reconcile folders from a community of users into coherent themes
Taxonomy synthesis: motivation • Autonomy vs collaboration • Personalizationpicking folders from Yahoo • Complex relations between users’ interests • Need the “simplest common ground” User1 User2 User3 Yahoo Cycling Sports Biz Sports Sports Shops Hiking Cycling Bikeshops Bikeshops Subsumption Tree ‘inversion’
Share documents Share folder Share terms Taxonomy synthesis: intuition kpfa.org Media bbc.co.uk kron.com Broadcasting channel4.com kcbs.com Entertainment foxmovies.com lucasfilms.com Studios miramax.com Folders Documents
Taxonomy synthesis: intuition kpfa.org Media Themes bbc.co.uk Radio kron.com Broadcasting channel4.com TV kcbs.com Entertainment foxmovies.com Movies lucasfilms.com Studios miramax.com Folders Documents
Trade-off • Using theme nodes can simplify graph • Shannon encoding of folder or theme ID • Increases distortion of term distribution • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder • Compare cost in bits
Media Entertainment Broadcasting Studios HAC Tree Documents Algorithm BestSingle • Pool all documents • Find bottom-up hierarchical clustering (HAC) using text only • Map each original folder to the one HAC node at the smallest KL distance • Low mapping cost, high distortion
PatchHAC and Bicriteria • PatchHAC: • Start with BestSingle • Greedily introduce additional mappings from folders to HAC nodes • Bicriteria: • Start with each document a theme • Collapse greedily while total code length decreases
Conclusion • Recording history is feasible and useful • Few kilobytes per day per user • Bookmark taxonomies are a valuable source of information; can be… • Integrated into dynamic community-specific taxonomies • Used to drive discovery and collaboration • Memex can guide peer proxy caches • Cooperative caching between departments
Software • Demo: www.cs.berkeley.edu/~soumen • Client: Signed Swing/JFC applet • Netscape4.5+ (IE, HotJava planned) • Server: DB2 + Berkeley DB + Servlets • Infrastructure for plugging in research prototypes using the Demon API • Clustering, classification, visualization • Collaborative filtering and recommendation
Related work • Archiving, searching, categorization • Vistabar (Alta Vista) • Bookmark organizer (IBM Haifa) • PowerBookmarks (NEC) • Purple Yogi • Netscape roaming access, Backflip • Mining • Attribute similarity via external probes • Non-linear dynamical systems