180 likes | 300 Views
Data-Specific Web Search. By: Aditya Mantri. Data on the web. Web is not anymore a only a means to share data and information. Comprises of social networking sites, wikis, blogs that facilitate creativity, collaborations and sharing among users.
E N D
Data-Specific Web Search By: Aditya Mantri
Data on the web • Web is not anymore a only a means to share data and information. • Comprises of social networking sites, wikis, blogs that facilitate creativity, collaborations and sharing among users. • Steady rise in the amount and significance of non-traditional data. • For eg. A simple blog is plain textual commentary; but now it’s no uncommon to see photoblogs, sketchblogs, vlogs, MP3 logs and podcasts.
Data on the web • Other than textual data web now contains, and can be searched for: • Multimedia (images, audio, video, animations, etc) • Blogs • News • Scientific/Research papers • Source Code • Jobs • Travel • Health • Classifieds
Multimedia (MM) • Fundamental IR and searching techniques cannot be applied to MM: • Multiple modalities – text, audio, still images & video. • Query in the form of words now cannot be matched directly to the raw multimedia file. • Size of MM data • Methods of storage and indexing need to be efficient • Therefore structural issues such as storage and networking, as well as intelligent content analysis need to addressed.
Multimedia (MM) • Basic challenge – understanding the user’s query. • Important to process raw multimedia and convert into high level semantics.
Multimedia (MM) • Text based search – ‘query by word’ • Metadata (filename, captions, tags) • Cues from text and HTML source code • Cues from image content (color, image size, file type, etc) • Content based search – when text annotations are nonexistent/incomplete. ‘query by similarity’ or ‘query by example’ • Image (shape, color, texture) • Video (motion of object spatio-temporal relations) • Audio (humming for music, sampling rate, pitch, brightness, bandwidth) • Relevance feedback - queries entered using either of the above methods. Results returned are used to improve the user-query.
Multimedia (Research) • Closely related to the research in the MM IR field. • Paradigm shift in Content Analysis – domain specific knowledge to bridge the semantic gap between features and semantic concepts very specific. • Content Mining and knowledge discovery • Automated content–based image and video Annotation • Better indexing techniques
Blogs • “A website or page that is the product of (generally) an individual or of non-commercial origin that uses a date-limited or diary format, and which is updated either daily or at least regularly with new information about a subject, range of subjects, or personal details.” • Key difference to note – • Temporal information • Connected community based collections – blogosphere • Personal nature (adds to the subjectivity) • Varying Structure
Blogs • Search strategy • Use basic keyword type search. • Then, use clustering to reduce the number of results returned. • Can optionally, use interconnections between blogs to follow a piece of conversation to gauge importance of a topic in the blog. • Research • Temporal Mining • Extraction of opinions from blogs • Domain specific weblogs. Using Machine Learning Techniques and probabilistic models.
News • News Search Engines, or new aggregators basically compile syndicated web content such as news articles from various reliable sources. • Each Search Engine differs in the way that they … • … crawl or index the news articles … some aggregators just scrape headlines • … calculate the relevance of a story based on the credibility • … presence on any human intervention
News • Research: • Solve the problem of vastness of content the aggregator returns to the user based on the keyword. Some sort of clustering mechanism is proposed. • Selection of credible articles by automatically filtering out wrong articles. • A good metric is ‘commonality’ • Other metrics such as ‘bias’, ‘objectivity’, etc have been proposed.
Common Issues and trends • Query formulation • Methods such as using domain semantics to represented as ontologies to specify/formulate queries • Improving relevance feedback • Improvement to the basic search infrastructure: crawling and indexing. • Improving and utilizing general web search trends such as personalization, improvement in UI design, etc. • Improvement to link analysis.
Unique, data-specific challenges • Multimedia • Bridging the semantic gap - allow the user to make queries in their own terminology. • Creating multimodal analysis and retrieval algorithms exploiting the synergy between the various media including text and context information. • Effective browsing and summarization techniques need to be addressed. • Creation of High Performance Indexes.
Unique, data-specific challenges • Blogs • Need to understand relationships between the title, body, and comments to create better clustering algorithms for selective blog search. • Need to understand the structure of a blog • Blogs can be written using word abbreviation and slang, in one or multiple paragraphs, formally or informally, etc. • Community structure and time stamping of blogs needs to be studied to extract cohesive discussions.
Unique, data-specific challenges • News search engine tech. is still considered a black art. • Indexing is a major challenge • Difficult to crawl and extract snippets from source that have varying structures and patterns. • The clustering techniques used to perform similarity matches need be enhanced to avoid presenting flat list of the search to the user. • It is difficult to device methods to find relevance of a news article from non-credible sources.
Conclusion • Definitely a growing need to investigate and research various issues related to data-specific web search. • Data-specific web search is quite different from traditional web search • Multimedia web search primarily differs due to its multimodal nature and size. • Blogs and News searches differ mainly due to its amorphous structure and temporal nature. • ‘Search for meaning’ has a ubiquitous significance for web search, as, rather than having search by inputting keywords, allowing users to make queries in their own terminology is becoming important.
References • Articles from Wikipedia • Wall, Aaron, comp. "Search Engine History." 18 Feb. 2008 <http://www.searchenginehistory.com/>. • Wikipedia. "Web search engine" < http://en.wikipedia.org/wiki/Web_search> • Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge UP., 2008. 18 Feb. 2008. <http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html>. • Emre Sokullu, "Search 2.0 - What's Next?". December 13, 2006. • John John B. Horrigan, "For many home broadband users, the internet is a primary news source". 22 March 2006. Pew Internet & American Life Project.<http://www.pewinternet.org/pdfs/PIP_News.and.Broadband.pdf> • “Multimedia Information Retreival – Challenges” , ACMSIGMM .<http://sigmm.utdallas.edu/Members/nicu/mir/challenges/> • "Web search engine multimedia functionality" Tjondronegoro D., Spink A. Information Processing and Management: an International Journal 44(1): 340-357, 2008. • Alan Hanjalic, Nicu Sebe, and Edward Chang "Multimedia Content Analysis, Management and Retrieval: Trends and Challenges". Multimedia Content Analysis, Management, and Retrieval 2006. • Wikipedia. “Blog” <http://en.wikipedia.org/wiki/Weblog> • Beibei Li, Shuting Xu. "Enhancing Clustering Blog Documents by Utilizing Author/Reader Comments". ACM Southeast Regional Conference. Proceedings of the 45th annual southeast regional conference. • Phil Bradley. "Search Engines: Weblog search engines". • <http://www.ariadne.ac.uk/issue36/search-engines> • Yun Chen, Flora S. Tsai, Kap Luk Chan. "Blog search and mining in the business domain". Year of Publication: 2007. ycos Retriever. “Google News” < http://www.lycos.com/info/google-news.html > • Yan, Wang, Guo, Yao, Lv, Wang, “The Optimization in News Search Engine Using Formal Concept Analysis Full text”. • Ryosuke Nagura, Yohei Seki, Noriko Kando, Masaki Aono. "A method of rating the credibility of news documents on the web".Year of Publication: 2006.