210 likes | 220 Views
This presentation discusses the concept of creating a semantically rich aggregate view of information on the web by transforming hyperlinked bags of words. It explores instances, instance representation, domain, usage studies, extraction techniques, application optimization, challenges, related work, and future developments.
E N D
A Web of Concepts Dalvi, et al. Presented by Andrew Zitzelberger
Vision • Transform hyperlinked bags of words into semantically rich aggregate view of information on the web.
Concept • Things of interest • Searching for information • Accomplishing a task • Reservations, etc.
Instances • Record of a concept • Restaurant • Gochi (19980 Homestead Rd Cupertino CA) • Academia? • Publications, research institutions
Instance Representation • Loosely-structured record (lrec) • Attribute-key, value pairs • Unique id field • Entity matching problem • Metadata • Attribute list
Domain • Set of related concepts • Academic community domain = {publications, people, conferences}
Usage StudyInstance vs. Concept Search • yelp.com • Month of queries resulting in a click (restaurants) • 59% specific business URL • 19% search URL either specific business or group • 11% specific group URL
Usage StudyConcept Attribute Search • Remove restaurant name and location information from query • Co-occuring words: • Menu (3%), coupons (1.8%), online, weekly specials, locations (1.5%) • Nutrition, to go, delivery, careers, cod
Usage StudyAggregation Value • 59% clicked on at least one other URL • 35% clicked on at least two other URLs • Small manual evaluation indicates pages are often about the same business.
Usage StudyConcepts vs. Browsing • 42% of homepage visits are from search engine • Immediately following URL • 11.5% location • 9% menu • 1% coupons • 10.5% of user trails contain more than one distinct instance of the restaurant concept
Extraction • Create new records from the web • Information extraction • Linking • Analysis • Meta-data tagging (cuisine type)
Domain-centric vs. Site-centric Extraction • Site-centric extraction • Wrappers for page structure • Probabilistic models (CRF) • Domain-centric extraction • Fields of interest • Statistical properties (single zip code, etc.) • Structure components (lists, link relationships)
Domain-centric Extraction • Aggregator mining • Learn from extracted knowledge (similar menus) • Matching • Text is “about” a record (restaurant review)
ApplicationSession Optimization • User understanding • Historical modeling • Session modeling • Content understanding • Example: Birks • Birks and Mayors (luxury Jewelers) vs. Birk’s Steakhouse
ApplicationBrowse Optimization • Alternatives: (Restaurants) • Similar type of cuisine • Similar location • Similar quality • Augmentations: (Camera) • Batteries • Memory cards
Concept Search Result Pages – shows multiple records Concept Pages – information about an instance Article Pages – a piece of authored text
Advertising • Increase in targeted advertisements • Target concepts rather than keywords
Challenges • Transfer learning • Transfer extractor knowledge • Tracking uncertainty • Accuracy issues • “Web of concepts is not a one time affair” • Wrapper problems • Concept updates • Relevance Measures • User satisfaction
Related Work • Information Extraction/Integration Systems • Dataspace Systems • Semantic Web
Future Work • Enrich representation model • Path storage to data • Provenance, versions, uncertainty • Hierarchal relationships (containment or inheritance) • Ranking of disparate sources