Data-Specific Web Search

Data-Specific Web Search By: Aditya Mantri

Data on the web • Web is not anymore a only a means to share data and information. • Comprises of social networking sites, wikis, blogs that facilitate creativity, collaborations and sharing among users. • Steady rise in the amount and significance of non-traditional data. • For eg. A simple blog is plain textual commentary; but now it’s no uncommon to see photoblogs, sketchblogs, vlogs, MP3 logs and podcasts.

Data on the web • Other than textual data web now contains, and can be searched for: • Multimedia (images, audio, video, animations, etc) • Blogs • News • Scientific/Research papers • Source Code • Jobs • Travel • Health • Classifieds

Multimedia (MM) • Fundamental IR and searching techniques cannot be applied to MM: • Multiple modalities – text, audio, still images & video. • Query in the form of words now cannot be matched directly to the raw multimedia file. • Size of MM data • Methods of storage and indexing need to be efficient • Therefore structural issues such as storage and networking, as well as intelligent content analysis need to addressed.

Multimedia (MM) • Basic challenge – understanding the user’s query. • Important to process raw multimedia and convert into high level semantics.

Multimedia (MM) • Text based search – ‘query by word’ • Metadata (filename, captions, tags) • Cues from text and HTML source code • Cues from image content (color, image size, file type, etc) • Content based search – when text annotations are nonexistent/incomplete. ‘query by similarity’ or ‘query by example’ • Image (shape, color, texture) • Video (motion of object spatio-temporal relations) • Audio (humming for music, sampling rate, pitch, brightness, bandwidth) • Relevance feedback - queries entered using either of the above methods. Results returned are used to improve the user-query.

Multimedia (Research) • Closely related to the research in the MM IR field. • Paradigm shift in Content Analysis – domain specific knowledge to bridge the semantic gap between features and semantic concepts very specific. • Content Mining and knowledge discovery • Automated content–based image and video Annotation • Better indexing techniques

Blogs • “A website or page that is the product of (generally) an individual or of non-commercial origin that uses a date-limited or diary format, and which is updated either daily or at least regularly with new information about a subject, range of subjects, or personal details.” • Key difference to note – • Temporal information • Connected community based collections – blogosphere • Personal nature (adds to the subjectivity) • Varying Structure

Blogs • Search strategy • Use basic keyword type search. • Then, use clustering to reduce the number of results returned. • Can optionally, use interconnections between blogs to follow a piece of conversation to gauge importance of a topic in the blog. • Research • Temporal Mining • Extraction of opinions from blogs • Domain specific weblogs. Using Machine Learning Techniques and probabilistic models.

News • News Search Engines, or new aggregators basically compile syndicated web content such as news articles from various reliable sources. • Each Search Engine differs in the way that they … • … crawl or index the news articles … some aggregators just scrape headlines • … calculate the relevance of a story based on the credibility • … presence on any human intervention

News • Research: • Solve the problem of vastness of content the aggregator returns to the user based on the keyword. Some sort of clustering mechanism is proposed. • Selection of credible articles by automatically filtering out wrong articles. • A good metric is ‘commonality’ • Other metrics such as ‘bias’, ‘objectivity’, etc have been proposed.

Common Issues and trends • Query formulation • Methods such as using domain semantics to represented as ontologies to specify/formulate queries • Improving relevance feedback • Improvement to the basic search infrastructure: crawling and indexing. • Improving and utilizing general web search trends such as personalization, improvement in UI design, etc. • Improvement to link analysis.

Unique, data-specific challenges • Multimedia • Bridging the semantic gap - allow the user to make queries in their own terminology. • Creating multimodal analysis and retrieval algorithms exploiting the synergy between the various media including text and context information. • Effective browsing and summarization techniques need to be addressed. • Creation of High Performance Indexes.

Unique, data-specific challenges • Blogs • Need to understand relationships between the title, body, and comments to create better clustering algorithms for selective blog search. • Need to understand the structure of a blog • Blogs can be written using word abbreviation and slang, in one or multiple paragraphs, formally or informally, etc. • Community structure and time stamping of blogs needs to be studied to extract cohesive discussions.

Unique, data-specific challenges • News search engine tech. is still considered a black art. • Indexing is a major challenge • Difficult to crawl and extract snippets from source that have varying structures and patterns. • The clustering techniques used to perform similarity matches need be enhanced to avoid presenting flat list of the search to the user. • It is difficult to device methods to find relevance of a news article from non-credible sources.

Conclusion • Definitely a growing need to investigate and research various issues related to data-specific web search. • Data-specific web search is quite different from traditional web search • Multimedia web search primarily differs due to its multimodal nature and size. • Blogs and News searches differ mainly due to its amorphous structure and temporal nature. • ‘Search for meaning’ has a ubiquitous significance for web search, as, rather than having search by inputting keywords, allowing users to make queries in their own terminology is becoming important.

References • Articles from Wikipedia • Wall, Aaron, comp. "Search Engine History." 18 Feb. 2008 <http://www.searchenginehistory.com/>. • Wikipedia. "Web search engine" < http://en.wikipedia.org/wiki/Web_search> • Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge UP., 2008. 18 Feb. 2008. <http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html>. • Emre Sokullu, "Search 2.0 - What's Next?". December 13, 2006. • John John B. Horrigan, "For many home broadband users, the internet is a primary news source". 22 March 2006. Pew Internet & American Life Project.<http://www.pewinternet.org/pdfs/PIP_News.and.Broadband.pdf> • “Multimedia Information Retreival – Challenges” , ACMSIGMM .<http://sigmm.utdallas.edu/Members/nicu/mir/challenges/> • "Web search engine multimedia functionality" Tjondronegoro D., Spink A. Information Processing and Management: an International Journal 44(1): 340-357, 2008. • Alan Hanjalic, Nicu Sebe, and Edward Chang "Multimedia Content Analysis, Management and Retrieval: Trends and Challenges". Multimedia Content Analysis, Management, and Retrieval 2006. • Wikipedia. “Blog” <http://en.wikipedia.org/wiki/Weblog> • Beibei Li, Shuting Xu. "Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments". ACM Southeast Regional Conference. Proceedings of the 45th annual southeast regional conference. • Phil Bradley. "Search Engines: Weblog search engines". • <http://www.ariadne.ac.uk/issue36/search-engines> • Yun Chen, Flora S. Tsai, Kap Luk Chan. "Blog search and mining in the business domain". Year of Publication: 2007. ycos Retriever. “Google News” < http://www.lycos.com/info/google-news.html > • Yan, Wang, Guo, Yao, Lv, Wang, “The Optimization in News Search Engine Using Formal Concept Analysis Full text”. • Ryosuke Nagura, Yohei Seki, Noriko Kando, Masaki Aono. "A method of rating the credibility of news documents on the web".Year of Publication: 2006.

Thanks!

Data-Specific Web Search

Data-Specific Web Search

Presentation Transcript

Search web

Web Search

Automatically Extracting Structured Data for Web Search

Web Search

Web search

Web Search

Data Mining Information Retrieval Web Search

Compressed Data Structures for Annotated Web Search

Web Search

Web Search

IC-Specific Data

Compressed Data Structures for Annotated Web Search

Data Mining Information Retrieval Web Search

Automatically Extracting Structured Data for Web Search

Web Search and Data Mining

Web Search

Web Search

Web Search

Web Search

Web Search

IC-Specific Data