270 likes | 470 Views
Fast Phrase Querying With Combined Indexes. HUGH E. WILLIAMS , JUSTIN ZOBEL , and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 , Issue 4 (October 2004). 2003 SCI Journal. ACM Transactions on Information Systems (TOIS).
E N D
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE School of Computer Science & Information Technology RMIT University, Australia TOIS, Volume 22 , Issue 4 (October 2004)
ACM Transactions on Information Systems (TOIS) Volume 22 , Issue 4 (October 2004) • Qualitative decision making in adaptive presentation of structured information Ronen I. Brafman, Carmel Domshlak, Solomon E. ShimonyPages: 503 – 539 • Analysis of lexical signatures for improving information persistence on the World Wide Web Seung-Taek Park, David M. Pennock, C. Lee Giles, Robert KrovetzPages: 540 – 572 • Fast phrase querying with combined indexes Hugh E. Williams, Justin Zobel, Dirk BahlePages: 573 – 594 • Information systems interoperability: What lies beneath? Jinsoo Park, Sudha RamPages: 595 - 632
Abstract • Search engines need to evaluate queries extremely fast (low disk overheads) • A significant proportion of the queries are phrases, indicated that some of the query terms must be ordered and adjacent • nextword indexes (indexes are twice as large) • special-purpose phrase indexes • combined version with inverted files, additional space overhead is only 26%
inverted list - no practical alternatives • phrase queries • inverted list 如果包含common word, 然後作phrase serach時就會變很慢 • 計算機結構Make common case fastfundamental law, called Amdahl's Law
PROPERTIES OF QUERIES • gather large numbers of queries and see how users are choosing to express their information needs • 8.3% were phrae queries “xx oo” • 41% remainings matched a phrase • 8.4% included one of {the, to, of} • 14.4% included one of the top-20 common terms • Structural common terms • common but played an important role • back to stopping issues • 本來122,438個queries對到309*106個documents • 3 stoppings會對到390*106, 20 stoppings會對到490*106,254 stoppings對到1693*106 • median number of words – 2; average 2. • 34% have 3 words or more; 1.3% have 6 words or more • 0.4% of phrase queries have {the, to, of} at the end • no 4+ queries terminate with a common term • short query, ending in a common term, the others are usually common
inverted index • no practical alternatives • term indexing, 2-level, a list of postings • document identifier • In-document frequency • and a list of offsets {d, fd,t, [o1, . . . , o fd,t ] } • stopping • complete phrase indexes
Sorted Phrase Algorithm • from a superset, becoming pruned,need n fetching and n-1 merging steps
考量 • 增加多一點additional information到inverted list裡面, 讓cpu去decode沒關係 • 以前的時代CPU cycles比較寶貴 • 現在的時代需要disk access較有效率一點, 所以若有用的資訊一次讀進來後, 剩下的讓cpu作很快 • 新的tradeoff在哪邊?
Phrase Indexes • Partial Phrase Indexes • 可將過去常搜尋的拿來當indexes
Combined Inverted and Nextword Indexes • Combined Inverted and Phase Indexes • Three-Way Index Combination
ACM Computing Surveys (CSUR) Volume 36 , Issue 1 (March 2004) • Advances in dataflow programming languages Wesley M. Johnston, J. R. Paul Hanna, Richard J. MillarPages: 1 – 34 • Image Retrieval from the World Wide Web: Issues, Techniques, and Systems M. L. Kherfi, D. Ziou, A. BernardiPages: 35 – 67 • Line drawing, leap years, and Euclid Mitchell A. Harris, Edward M. ReingoldPages: 68 - 80
ACM Computing Surveys (CSUR) Volume 35 , Issue 4 (December 2003) • An analysis of XML database solutions for the management of MPEG-7 media descriptions Utz Westermann, Wolfgang KlasPages: 331 – 373 • A survey of Web cache replacement strategies Stefan Podlipnig, Laszlo BöszörmenyiPages: 374 – 398 • Face recognition: A literature survey W. Zhao, R. Chellappa, P. J. Phillips, A. RosenfeldPages: 399 - 458
Intelligent Systems, IEEE Volume: 19, Issue: 4, Year: July-Aug. 2004 • Ontology versioning in an ontology management frameworkNoy, N.F.; Musen, M.A.Page(s): 6- 13 • Guest Editors' Introduction: Semantic Web ServicesPayne, T.; Lassila, O.Page(s): 14- 15 • Automatically composed workflows for grid environments • ODE SWS: a framework for designing and composing semantic Web services • KAoS policy management for semantic Web services • Filtering and selecting semantic Web services with interactive composition techniques • Authorization and privacy for semantic Web services • Value Webs: using ontologies to bundle real-world services …
ACM Transactions on Information Systems (TOIS) Volume 22 , Issue 3 (July 2004) • Relevance models to help estimate document and query parameters David BodoffPages: 357 – 380 • Efficient mining of both positive and negative association rules Xindong Wu, Chengqi Zhang, Shichao ZhangPages: 381 – 405 • Trustworthy 100-year digital objects: Evidence after every witness is dead Henry M. GladneyPages: 406 – 436 • PocketLens: Toward a personal recommender system Bradley N. Miller, Joseph A. Konstan, John RiedlPages: 437 – 476 • Distributed content-based visual information retrieval system on peer-to-peer networks Irwin King, Cheuk Hang Ng, Ka Cheung SiaPages: 477 - 501
一個人掃約15~20分鐘 • 預計每週兩個人用電腦random排
利用搜尋引擎協助錯字偵測之應用 • 姍姍來遲 google 3130 openfind 7100 • 珊珊來遲 google 2410 openfind 737 • Features of the classifier • Naïve 直接用page count就好 • Complex: return回來的前URL, summary是不是就有其他錯字了 • 成功運用了local context information • Application • 改錯字 • 建立錯別字資料庫 • 藉以知道哪些網站都拼錯字, 進而判別該網站不reliable • Issues: • How to detect the 錯字 candidates? • How to get the initial gold standard for evaluation? • 學術網站的, 用字應該比較精確... 先相信他 • 有些是通用的, 不是錯的 • 能不能擴充到別國的語言
(LDC) Chinese Gigaword • Authors: David Graff, Ke Chen • Data Source(s): newswire • Project(s): EARS, TIDES • Distribution: 1 DVD(s). • Membership Year(s): • 2003Non-member Price: US$2500 • Central News Agency of Taiwan(cna) • Xinhua News Agency of Beijing • Mandarin Chinese News Text • TREC Mandarin • TDT Multilanguage Text corpora • UTF-8 character encoding
The Stanford NLP group includes: • Professors • Chris Manning, Computer Science and Linguistics • Dan Jurafsky, Linguistics • 2 語言學 postdocs, (1 Chinese) • 8 phd students (3 visiting from other schools) • 4 碩士生, 一助理, 10個已畢業
Topics • NER • Topic Detecting, Summary • QA (林川傑) • X • 林其青 • 林其青 • 楊宸彥, NTU tagger • Multilingual IR • X • Transliteration • IR with Image, Media, Web • Bio info • Opinion • Computational Semantics • Named Entity Recognition (NER) and Information Extraction (IE) • The Stanford Edinburgh Entity Recognition (SEER) Project • Shallow Semantic Parsing • Question Answering (QA) • Knowledge Representation from Text • The NLKR Project: solving natural-language logic puzzles • Thesaurus Induction • Word Sense Disambiguation (WSD) • Parsing & Tagging • Probabilistic Parsing, Part-of-speech (POS) tagging • Multilingual NLP • Chinese NLP, Arabic NLP, German NLP • Unsupervised Induction of Linguistic Structure • Grammar Induction • Morphology & Phonology Induction • Thesaurus Induction • Other • Personalized PageRank algorithms • Clustering Models • Computational Lexicography • Text Categorization • Discriminative Models
The Stanford Parser • Java implementations of probabilistic natural language parsers, both highly optimized PCFG and dependency parsers, and a lexicalized PCFG parser • The Stanford Classifier • A Java implementation of conditional loglinear model classification (a.k.a., maximum entropy models) • The Stanford POS Tagger • A Java implementation of a maximum-entropy part-of-speech (POS) tagger • QuASI Software