Fast Phrase Querying With Combined Indexes

Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 , Issue 4 (October 2004)

2003 SCI Journal

ACM Transactions on Information Systems (TOIS) Volume 22 , Issue 4 (October 2004) • Qualitative decision making in adaptive presentation of structured information Ronen I. Brafman, Carmel Domshlak, Solomon E. ShimonyPages: 503 – 539 • Analysis of lexical signatures for improving information persistence on the World Wide Web Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert KrovetzPages: 540 – 572 • Fast phrase querying with combined indexes Hugh E. Williams, Justin Zobel, Dirk BahlePages: 573 – 594 • Information systems interoperability: What lies beneath? Jinsoo Park, Sudha RamPages: 595 - 632

Abstract • Search engines need to evaluate queries extremely fast (low disk overheads) • A significant proportion of the queries are phrases, indicated that some of the query terms must be ordered and adjacent • nextword indexes (indexes are twice as large) • special-purpose phrase indexes • combined version with inverted files, additional space overhead is only 26%

inverted list - no practical alternatives • phrase queries • inverted list 如果包含common word, 然後作phrase serach時就會變很慢 • 計算機結構Make common case fastfundamental law, called Amdahl's Law

PROPERTIES OF QUERIES • gather large numbers of queries and see how users are choosing to express their information needs • 8.3% were phrae queries “xx oo” • 41% remainings matched a phrase • 8.4% included one of {the, to, of} • 14.4% included one of the top-20 common terms • Structural common terms • common but played an important role • back to stopping issues • 本來122,438個queries對到309*106個documents • 3 stoppings會對到390*106, 20 stoppings會對到490*106,254 stoppings對到1693*106 • median number of words – 2; average 2. • 34% have 3 words or more; 1.3% have 6 words or more • 0.4% of phrase queries have {the, to, of} at the end • no 4+ queries terminate with a common term • short query, ending in a common term, the others are usually common

Phrase query evaluation Test Data

inverted index • no practical alternatives • term indexing, 2-level, a list of postings • document identifier • In-document frequency • and a list of offsets {d, fd,t, [o1, . . . , o fd,t ] } • stopping • complete phrase indexes

Sorted Phrase Algorithm • from a superset, becoming pruned,need n fetching and n-1 merging steps

考量 • 增加多一點additional information到inverted list裡面, 讓cpu去decode沒關係 • 以前的時代CPU cycles比較寶貴 • 現在的時代需要disk access較有效率一點, 所以若有用的資訊一次讀進來後, 剩下的讓cpu作很快 • 新的tradeoff在哪邊?

Phrase Indexes • Partial Phrase Indexes • 可將過去常搜尋的拿來當indexes

Nextword Indexes

Combined Inverted and Nextword

Combined Inverted and Nextword Indexes • Combined Inverted and Phase Indexes • Three-Way Index Combination

ACM Computing Surveys (CSUR) Volume 36 , Issue 1 (March 2004) • Advances in dataflow programming languages Wesley M. Johnston, J. R. Paul Hanna, Richard J. MillarPages: 1 – 34 • Image Retrieval from the World Wide Web: Issues, Techniques, and Systems M. L. Kherfi, D. Ziou, A. BernardiPages: 35 – 67 • Line drawing, leap years, and Euclid Mitchell A. Harris, Edward M. ReingoldPages: 68 - 80

ACM Computing Surveys (CSUR) Volume 35 , Issue 4 (December 2003) • An analysis of XML database solutions for the management of MPEG-7 media descriptions Utz Westermann, Wolfgang KlasPages: 331 – 373 • A survey of Web cache replacement strategies Stefan Podlipnig, Laszlo BöszörmenyiPages: 374 – 398 • Face recognition: A literature survey W. Zhao, R. Chellappa, P. J. Phillips, A. RosenfeldPages: 399 - 458

Intelligent Systems, IEEE Volume: 19, Issue: 4, Year: July-Aug. 2004 • Ontology versioning in an ontology management frameworkNoy, N.F.; Musen, M.A.Page(s): 6- 13 • Guest Editors' Introduction: Semantic Web ServicesPayne, T.; Lassila, O.Page(s): 14- 15 • Automatically composed workflows for grid environments • ODE SWS: a framework for designing and composing semantic Web services • KAoS policy management for semantic Web services • Filtering and selecting semantic Web services with interactive composition techniques • Authorization and privacy for semantic Web services • Value Webs: using ontologies to bundle real-world services …

ACM Transactions on Information Systems (TOIS) Volume 22 , Issue 3 (July 2004) • Relevance models to help estimate document and query parameters David BodoffPages: 357 – 380 • Efficient mining of both positive and negative association rules Xindong Wu, Chengqi Zhang, Shichao ZhangPages: 381 – 405 • Trustworthy 100-year digital objects: Evidence after every witness is dead Henry M. GladneyPages: 406 – 436 • PocketLens: Toward a personal recommender system Bradley N. Miller, Joseph A. Konstan, John RiedlPages: 437 – 476 • Distributed content-based visual information retrieval system on peer-to-peer networks Irwin King, Cheuk Hang Ng, Ka Cheung SiaPages: 477 - 501

一個人掃約15~20分鐘 • 預計每週兩個人用電腦random排

利用搜尋引擎協助錯字偵測之應用 • 姍姍來遲 google 3130 openfind 7100 • 珊珊來遲 google 2410 openfind 737 • Features of the classifier • Naïve 直接用page count就好 • Complex: return回來的前URL, summary是不是就有其他錯字了 • 成功運用了local context information • Application • 改錯字 • 建立錯別字資料庫 • 藉以知道哪些網站都拼錯字, 進而判別該網站不reliable • Issues: • How to detect the 錯字 candidates? • How to get the initial gold standard for evaluation? • 學術網站的, 用字應該比較精確... 先相信他 • 有些是通用的, 不是錯的 • 能不能擴充到別國的語言

(LDC) Chinese Gigaword • Authors: David Graff, Ke Chen • Data Source(s): newswire • Project(s): EARS, TIDES • Distribution: 1 DVD(s). • Membership Year(s): • 2003Non-member Price: US$2500 • Central News Agency of Taiwan(cna) • Xinhua News Agency of Beijing • Mandarin Chinese News Text • TREC Mandarin • TDT Multilanguage Text corpora • UTF-8 character encoding

The Stanford NLP group includes: • Professors • Chris Manning, Computer Science and Linguistics • Dan Jurafsky, Linguistics • 2 語言學 postdocs, (1 Chinese) • 8 phd students (3 visiting from other schools) • 4 碩士生, 一助理, 10個已畢業

Topics • NER • Topic Detecting, Summary • QA (林川傑) • X • 林其青 • 林其青 • 楊宸彥, NTU tagger • Multilingual IR • X • Transliteration • IR with Image, Media, Web • Bio info • Opinion • Computational Semantics • Named Entity Recognition (NER) and Information Extraction (IE) • The Stanford Edinburgh Entity Recognition (SEER) Project • Shallow Semantic Parsing • Question Answering (QA) • Knowledge Representation from Text • The NLKR Project: solving natural-language logic puzzles • Thesaurus Induction • Word Sense Disambiguation (WSD) • Parsing & Tagging • Probabilistic Parsing, Part-of-speech (POS) tagging • Multilingual NLP • Chinese NLP, Arabic NLP, German NLP • Unsupervised Induction of Linguistic Structure • Grammar Induction • Morphology & Phonology Induction • Thesaurus Induction • Other • Personalized PageRank algorithms • Clustering Models • Computational Lexicography • Text Categorization • Discriminative Models

The Stanford Parser • Java implementations of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser • The Stanford Classifier • A Java implementation of conditional loglinear model classification (a.k.a., maximum entropy models) • The Stanford POS Tagger • A Java implementation of a maximum-entropy part-of-speech (POS) tagger • QuASI Software

Fast Phrase Querying With Combined Indexes

Fast Phrase Querying With Combined Indexes

Presentation Transcript

Fast multiresolution image querying

Indexes

Extreme Querying With analytics

Indexes

Using Partial Indexes with PostgreSQL

Indexes

Indexes

Indexes

Indexes

Improving SMT with Phrase to Phrase Translations

Indexes

Indexes

Fast Phrase Querying With Combined Indexes

Private Information Retrieval Scheme Combined with E-Payment in Querying Valuable Information

Indexes

Indexes

Fast multiresolution image querying

Indexes

Indexes:

Indexes

Module 8: Querying Full-Text Indexes