850 likes | 1.09k Views
Text Information Retrieval and Applications – Advanced Topics. By J. H. Wang May 27, 2009. Outline. Advanced Retrieval Technologies Cross-Language Information Retrieval Multimedia Information Retrieval Semantic Retrieval Applications to IR Advanced Google Meta Search
E N D
Text Information Retrieval and Applications – Advanced Topics By J. H. Wang May 27, 2009
Outline • Advanced Retrieval Technologies • Cross-Language Information Retrieval • Multimedia Information Retrieval • Semantic Retrieval • Applications to IR • Advanced Google • Meta Search • Search Result Clustering
Advanced Retrieval Technologies • Cross-Language Information Retrieval (CLIR) • Multimedia IR (image, speech, music, video) • Semantic retrieval (XML, Semantic Web)
Cross-Language Information Retrieval • Cross Language Information Retrieval (CLIR) -- A technology enabling users to query in one language and retrieve relevant documents written or indexed in another language
Cross Language Web Search • A technology enabling users to query in one language and retrieve relevant Web pages written or indexed in another language
Why “Cross-Language”? • Source: Global Reach (global-reach.biz/globstats)
Top Ten Languages Used in the Web Source: Internet World Stats (Mar. 31, 2009) More and more non-English users!
Web Content More and more non-English pages Source: Network Wizards Internet Domain Survey (Jan 99 )
Chart of Web Content (by Language) [Source: Vilaweb.com, as quoted by eMarketer (Feb. 2001)] • Total Web pages: 313 B • English 68.4% • Japanese 5.9% • German 5.8% • Chinese 3.9% • French 3.0% • Spanish 2.4% • Russian 1.9% • Italian 1.6% • Portuguese 1.4% • Korean 1.3% • Other 4.6%
Language Percent of Public Sites • English 72% • German 7% • Japanese 6% • Spanish 3% • French 3% • Italian 2% • Dutch 2% • Chinese 2% • Korean 1% • Portuguese 1% • Russian 1% • Polish 1% [Source: OCLC, 2002]
Web Users and Pages(10 years ago) Challenge of Scalability ! Total Users: 800MChinese Users: 110M Including 87M (CN), 4.9M (HK), 11.6M (TW), 2.9M (MY), 2.14M (SG), 1.5M (US), and others. Source: Global Reach, 2004
Number of Chinese Web Pages 10,030,000,000 pages Scalability Problem !
Number of Web Pages The world’s largest search engine ? Billions Of Textual Documents IndexedDecember 1995-September 2003 KEY: GG=Google, ATW=AllTheWeb, INK=Inktomi, TMA=Teoma, AV=AltaVista. Source: Search Engine Watch (Nov. 2004)
Number of Web Pages • Estimated size: • Web pages in the world: 19.2 billion pages (indexed by Yahoo as of August 2005) • Websites in the world: 70,392,567 websites (indexed by Netcraft as of August 2005) • Web pages per website: 273 (rounding to the nearest whole number) • Updated estimate: • 231,510,169 distinct websites (as found by the Netcraft Web Server Survey in April 2009) • 63.2 billion [Source: http://news.netcraft.com/archives/web_server_survey.html] [Source: http://www.boutell.com/newfaq/misc/sizeofweb.html]
Number of Web Pages • 1 trillion unique URLs (We knew the web was big, by Jesse Alpert & Nissan Hajaj, Software Engineers, Web Search Infrastructure Team, 25 July 2008) • 19,200,000,000 pages (Mayer, Tim, 8 August 2005, Our Blog is Growing Up And So Has Our Index) • 320,000,000 pages (World Wide Web is 320 million and growing, BBC News Sci/Tech, 3 April 1998.) • 1,000,000,000 pages (Internet. How much information? 2000. Regents of the University of California.) • 800,000,000 pages (Maran, Ruth, and Paul Whitehead. "Web Pages." Internet and World Wide Web Simplified, 3rd ed. Foster City: IDG Books Worldwide, 1999. ) • 8,034,000,000 pages (Miller, Colleen. web sites: number of pages. NEC Research, IDC.) [Source: http://hypertextbook.com/facts/2007/LorantLee.shtml]
Challenge of Cross-Language Web Search • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup • 81% of the search terms could not be obtained from common English-Chinese translation dictionaries 中央處理器 (CPU), 電子商務 (E-commerce), 個人數位助理(PDA), 雅虎 (Yahoo), 太空總署 (NASA), 星際大戰 (Star War), 非典型肺炎 (SARS), …
Challenge • Existing CLIR systems mostly rely on bilingual dictionaries and dictionary lookup • 81% of the search requests could not be obtained from common English-Chinese translation dictionaries • How to find effective translations automatically for query terms not included in a dictionary ?
Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 Possible global use
English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 Need for CLIR services
English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Query Translation
English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Cost-ineffective to construct translation dictionaries Query Translation
English Query Porcelain ? Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 Query Translation Taking the Web as online corpus to deal with translation of unknown terms Web
Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 故宮/故宮博物院 English Query Query Translation National Palace Museum ? Online Term Translation Suggestions Web
Query Translation & CLIR in DL Chinese Query Mono-Lingual Document Search Chinese Digital Libraries 瓷器 瓷器/瓷/陶瓷 English/Japanese/Korean Queries Query Translation ? Auto- generated Translation Lexicons Web
CLIR • Conventional approach to query translation • Parallel documents as the corpus • Assume long queries • Problems of CLIR in digital libraries • No corpus for cross-lingual training • Short queries “Out-of-dictionary” terms • Ex: proper nouns, new terminologies, …
Translation Lexicon Construction for CLIR • To use the Web as the corpus for query translation • Web mining techniques • Anchor-text-based[ACM TOIS ‘04, ACM TALIP ‘02] • Search-result-based [JCDL ‘04] • To extract terms from real document collections as possible queries • Term extraction method [SIGIR ‘97]
Web Mining Approach to Term Translation Extraction The Web • LiveTrans: http://wkd.iis.sinica.edu.tw/LiveTrans/ Source query Anchor texts Academia Sinica LiveTrans Engine Target translations Search results 中央研究院/中研院
National Palace Museum vs. 故宮博物院Search-Result Page Noises • Mixed-language characteristic in Chinese pages • How to extract translation candidates? • Which candidates to choose?
Yahoo vs. 雅虎 -- Anchor-Text Set • Anchor text (link text) • The descriptive text of a link on a Web page • Anchor-text set • A set of anchor texts pointing to the same page (URL) • Multilingual translations • Yahoo/雅虎/야후 • America/美国/アメリカ • Anchor-text-set corpus • A collection of anchor-text sets 야후-USA Korea Yahoo Search Engine Yahoo! America http://www.yahoo.com • アメリカのYahoo! 美国雅虎 雅虎搜尋引擎 Japan Taiwan China
Anchor-TextCorpus Search-Result Pages Term Translation Extraction from Different Resources WebSpider Term Extraction Search Engine SimilarityEstimation Source Query Target Translation National Palace Museum 國立故宮博物院, 故宮, 故宮博物院
Multimedia IR • Different forms of information need • Image retrieval • Speech information retrieval • Music information retrieval • Video information retrieval
Image Retrieval • Content-based • Query by image content • Query by example (以圖找圖) • Similarity in visual features • Color, texture, shape, … • Relevance feedback • Text-based • Annotation
Content-Based Image Retrieval (CBIR) • Example systems • CIRES (Content-based Image Retrieval System): http://amazon.ece.utexas.edu/~qasim/research.htm • SIMPLIcity: http://www-db.stanford.edu/IMAGE/ • National Museum of History: http://210.201.141.12/cgi-bin/cbir-query.cgi?tid=-1 • …
Relevance Feedback (RF) Source: Dr. Cheng Image Similar images (no RF)
Similar Images Using Relevance Feedback Image Similar images using RF
Automatic Image Annotation Problem 1 Keywords? Visual Similarity polar bear ice snow white bear snow tundra polar bears snow fight Image Banks with Annotations
Spoken Document Retrieval • Spoken document retrieval • Indexing speech messages using speech recognition • Retrieving relevant messages for a text/speech query • Techniques • Document Processing: acoustic change detection, speech/non-speech detection, Mandarin/non-Mandarin detection, story segmentation, speaker recognition/clustering • Speech Recognition • Indexing/Retrieval
SoVideo http://slam.iis.sinica.edu.tw/demo.htm
Music Information Retrieval • Finding a song by similar melody • Query by singing • Query by humming • Singer identification • Background noise • Singer voice model • Demo: • http://slam.iis.sinica.edu.tw/demo.htm
Video Information Retrieval • Difference with CBIR • Temporal information • Structural organization • Complexity of querying system • Techniques • Video segmentation • Keyframe identification
Semantic Retrieval • HTML vs. XML • Semantic Web (Agent, Ontology, RDF)
Common Language of the Web • HTML • Link: Pi Pj • URL (URI), anchor text • Part-of National Taiwan University http://www.ntu.edu.tw/ NTU
100 53 50 50 50 3 9 3 3 Link Analysis –Hubs & Authorities in PageRank
Current Web Search • Keyword-based search (e.g., Google) • Full text indexing • Page authority (link analysis) • Page popularity (query log and user’s click) • Problems • Not specific • Data in pages have no semantic annotations • Yo-yo Ma’s most recent CD • No topic disambiguation • Documents with different topics mix together • Yo-yo Ma’s CDs, concerts, biography, gossips,…
Search on Semantic Web • Metadata search • To increase precision and flexibility • Topic-based search • To help contextualize queries and overlay results in terms of a knowledge base
XML (Extensible Markup Language) • More flexible tags • DTD (Data Type Definition) • Definition of the tags