Using Memex to archive and mine community Web browsing experience

Using Memex to archive and mine community Web browsing experience Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari Indian Institute of Technology Bombay

Information sources on the Web • Web page contents • Early keyword search engines • Hyperlink structure • Later engines: Google, Raging Search • Searching behavior • Search site monitor clicks on search results • Browsing behavior • Easily captured in stand-alone hypermedia • Need software infrastructure for the Web

Personal Memex • Archiving is feasible • ~25 GB in a lifetime • Why archive? • Recall past events • Create a ‘profile’ • Correlate with sites, directories, searches • Challenges • Flexible architecture • Analyses techniques Your husband died,but here is his Memex (From Jim Gray’s Turing Award Lecture)

Searching the personal Memex • Keyword search (never lose a page) • Advanced queries • Recreate my recent surfing history w.r.t. the topic ‘bicycling’ • Extract from the MIT Web site all pages that match my ‘compiler research’ profile • Topic taxonomy plays a central role • Characterized by bookmark folders • More familiar than ‘universal’ directories

Archiving architecture choices • Bookmarks only or all click history • Installed application or plug-in • Closer integration, e.g. with COM • CGI and Javascript • Slow, hard to monitor all clicks • Applet-servlet • Portable, better UI compared to HTML • Proxy or wiretap • Proxy involves configuring browser

Memex block diagram Browser Memex server Visit Client JAR Taxonomy synthesis Resource discovery Search Attach Recommendation Folder Download Context Classification Mining demons Running client applet Event-handler servlets Archive Clustering Relational metadata Text index Topic models Memex client-server protocol and workload sharing negotiations

X Document workflow Page visit and bookmarking events logged NODE table Browser Memex client Push new version Per-document version queue Crawler Pop and discard old version Demon Registry Search indexer Classifier service Clustering service Garbage collector

Folder tab • Valuable user input and feedback on topics and example documents User cuts and pastes to correct or reinforce the Memex classifier ‘?’ indicates automatic placement by Memex classifier File manager- like interface Privacy choice

Context tab Replay of recent browsing context restricted to chosen topic Choice of topic context Better mobility than one- dimensional history provided by popular browsers Active browser monitoring and dynamic layout of new incremental context graph

Search using keyword and visit statistics Search tab • “Find the paper about collaborative filtering I was reading a month back”

Mining issues • Two relations • occurs_in(term, document) • bookmarked_into(document, folder) • (Ignore hyperlinks for now) • Document classification and clustering • Exploit ‘bookmarked_into’ • Taxonomy synthesis • Reconcile folders from a community of users into coherent themes

Taxonomy synthesis: motivation • Autonomy vs collaboration • Personalizationpicking folders from Yahoo • Complex relations between users’ interests • Need the “simplest common ground” User1 User2 User3 Yahoo Cycling Sports Biz Sports Sports Shops Hiking Cycling Bikeshops Bikeshops Subsumption Tree ‘inversion’

Share documents Share folder Share terms Taxonomy synthesis: intuition kpfa.org Media bbc.co.uk kron.com Broadcasting channel4.com kcbs.com Entertainment foxmovies.com lucasfilms.com Studios miramax.com Folders Documents

Taxonomy synthesis: intuition kpfa.org Media Themes bbc.co.uk Radio kron.com Broadcasting channel4.com TV kcbs.com Entertainment foxmovies.com Movies lucasfilms.com Studios miramax.com Folders Documents

Trade-off • Using theme nodes can simplify graph • Shannon encoding of folder or theme ID • Increases distortion of term distribution • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder • Compare cost in bits

Media Entertainment Broadcasting Studios HAC Tree Documents Algorithm BestSingle • Pool all documents • Find bottom-up hierarchical clustering (HAC) using text only • Map each original folder to the one HAC node at the smallest KL distance • Low mapping cost, high distortion

PatchHAC and Bicriteria • PatchHAC: • Start with BestSingle • Greedily introduce additional mappings from folders to HAC nodes • Bicriteria: • Start with each document a theme • Collapse greedily while total code length decreases

Conclusion • Recording history is feasible and useful • Few kilobytes per day per user • Bookmark taxonomies are a valuable source of information; can be… • Integrated into dynamic community-specific taxonomies • Used to drive discovery and collaboration • Memex can guide peer proxy caches • Cooperative caching between departments

Software • Demo: www.cs.berkeley.edu/~soumen • Client: Signed Swing/JFC applet • Netscape4.5+ (IE, HotJava planned) • Server: DB2 + Berkeley DB + Servlets • Infrastructure for plugging in research prototypes using the Demon API • Clustering, classification, visualization • Collaborative filtering and recommendation

Related work • Archiving, searching, categorization • Vistabar (Alta Vista) • Bookmark organizer (IBM Haifa) • PowerBookmarks (NEC) • Purple Yogi • Netscape roaming access, Backflip • Mining • Attribute similarity via external probes • Non-linear dynamical systems

Using Memex to archive and mine community Web browsing experience

Using Memex to archive and mine community Web browsing experience

Presentation Transcript

Safe(r) Web Browsing

Personal Memex

Digital Memories (Memex) Workshop AKA “Memex Day”

Browsing the Web

BCI Web Browsing

Lesson 4: Web Browsing

Web Browsing Policy Compliance Monitoring Using Keylogging

Finding, browsing, and getting data easily using SPDF web services

Personalizing Web Search using Long Term Browsing History

How to have private web browsing

Efficient Web Browsing on Handheld Devices Using Page and Form Summarization

Comfortable Web Browsing

Web browsing

Using Mozilla Firefox Web Browser: How To Use Tabbed Browsing

Hacked While Browsing — Using the Web to Spread Malware

Searching and Browsing Using Tags

Factors influencing Web browsing

Browsing the Web

Practising Safer Web Browsing

Memex: A Browsing Assistant for Collaborative Archiving and Mining of Surf Trails

Simple Guide To Safe Web Browsing