1 / 20

Using Memex to archive and mine community Web browsing experience

Using Memex to archive and mine community Web browsing experience. Soumen Chakrabarti Sandeep Srivastava Mallela Subramanyam Mitul Tiwari Indian Institute of Technology Bombay. Information sources on the Web. Web page contents Early keyword search engines Hyperlink structure

arnon
Download Presentation

Using Memex to archive and mine community Web browsing experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Memex to archive and mine community Web browsing experience Soumen ChakrabartiSandeep SrivastavaMallela SubramanyamMitul Tiwari Indian Institute of Technology Bombay

  2. Information sources on the Web • Web page contents • Early keyword search engines • Hyperlink structure • Later engines: Google, Raging Search • Searching behavior • Search site monitor clicks on search results • Browsing behavior • Easily captured in stand-alone hypermedia • Need software infrastructure for the Web

  3. Personal Memex • Archiving is feasible • ~25 GB in a lifetime • Why archive? • Recall past events • Create a ‘profile’ • Correlate with sites, directories, searches • Challenges • Flexible architecture • Analyses techniques Your husband died,but here is his Memex (From Jim Gray’s Turing Award Lecture)

  4. Searching the personal Memex • Keyword search (never lose a page) • Advanced queries • Recreate my recent surfing history w.r.t. the topic ‘bicycling’ • Extract from the MIT Web site all pages that match my ‘compiler research’ profile • Topic taxonomy plays a central role • Characterized by bookmark folders • More familiar than ‘universal’ directories

  5. Archiving architecture choices • Bookmarks only or all click history • Installed application or plug-in • Closer integration, e.g. with COM • CGI and Javascript • Slow, hard to monitor all clicks • Applet-servlet • Portable, better UI compared to HTML • Proxy or wiretap • Proxy involves configuring browser

  6. Memex block diagram Browser Memex server Visit Client JAR Taxonomy synthesis Resource discovery Search Attach Recommendation Folder Download Context Classification Mining demons Running client applet Event-handler servlets Archive Clustering Relational metadata Text index Topic models Memex client-server protocol and workload sharing negotiations

  7. X Document workflow Page visit and bookmarking events logged NODE table Browser Memex client Push new version Per-document version queue Crawler Pop and discard old version Demon Registry Search indexer Classifier service Clustering service Garbage collector

  8. Folder tab • Valuable user input and feedback on topics and example documents User cuts and pastes to correct or reinforce the Memex classifier ‘?’ indicates automatic placement by Memex classifier File manager- like interface Privacy choice

  9. Context tab Replay of recent browsing context restricted to chosen topic Choice of topic context Better mobility than one- dimensional history provided by popular browsers Active browser monitoring and dynamic layout of new incremental context graph

  10. Search using keyword and visit statistics Search tab • “Find the paper about collaborative filtering I was reading a month back”

  11. Mining issues • Two relations • occurs_in(term, document) • bookmarked_into(document, folder) • (Ignore hyperlinks for now) • Document classification and clustering • Exploit ‘bookmarked_into’ • Taxonomy synthesis • Reconcile folders from a community of users into coherent themes

  12. Taxonomy synthesis: motivation • Autonomy vs collaboration • Personalizationpicking folders from Yahoo • Complex relations between users’ interests • Need the “simplest common ground” User1 User2 User3 Yahoo Cycling Sports Biz Sports Sports Shops Hiking Cycling Bikeshops Bikeshops Subsumption Tree ‘inversion’

  13. Share documents Share folder Share terms Taxonomy synthesis: intuition kpfa.org Media bbc.co.uk kron.com Broadcasting channel4.com kcbs.com Entertainment foxmovies.com lucasfilms.com Studios miramax.com Folders Documents

  14. Taxonomy synthesis: intuition kpfa.org Media Themes bbc.co.uk Radio kron.com Broadcasting channel4.com TV kcbs.com Entertainment foxmovies.com Movies lucasfilms.com Studios miramax.com Folders Documents

  15. Trade-off • Using theme nodes can simplify graph • Shannon encoding of folder or theme ID • Increases distortion of term distribution • Kullbach-Leibler (KL) distance of distorted folder w.r.t. ‘true’ folder • Compare cost in bits

  16. Media Entertainment Broadcasting Studios HAC Tree Documents Algorithm BestSingle • Pool all documents • Find bottom-up hierarchical clustering (HAC) using text only • Map each original folder to the one HAC node at the smallest KL distance • Low mapping cost, high distortion

  17. PatchHAC and Bicriteria • PatchHAC: • Start with BestSingle • Greedily introduce additional mappings from folders to HAC nodes • Bicriteria: • Start with each document a theme • Collapse greedily while total code length decreases

  18. Conclusion • Recording history is feasible and useful • Few kilobytes per day per user • Bookmark taxonomies are a valuable source of information; can be… • Integrated into dynamic community-specific taxonomies • Used to drive discovery and collaboration • Memex can guide peer proxy caches • Cooperative caching between departments

  19. Software • Demo: www.cs.berkeley.edu/~soumen • Client: Signed Swing/JFC applet • Netscape4.5+ (IE, HotJava planned) • Server: DB2 + Berkeley DB + Servlets • Infrastructure for plugging in research prototypes using the Demon API • Clustering, classification, visualization • Collaborative filtering and recommendation

  20. Related work • Archiving, searching, categorization • Vistabar (Alta Vista) • Bookmark organizer (IBM Haifa) • PowerBookmarks (NEC) • Purple Yogi • Netscape roaming access, Backflip • Mining • Attribute similarity via external probes • Non-linear dynamical systems

More Related