Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice Ch5. Ranking with Indexes

5장에서는 전반부 (~ 5.4 Compression) Inverted index의 종류와 각각의 종류가 담고 있는 정보 리스트를 압축하는 알고리즘 후반부 (성빈) Inverted index구축 방법 쿼리에 대해 문서의 랭킹을 계산하는 방법

개 요 • 5.1 Overview • 5.2 Abstract Model of Ranking • 5.3 Inverted Indexes • 3 types of inverted indexes • fields and extents • etc • 5.4 Compression • concept • bit-aligned • byte-aligned • compression in practice

5.1 Overview • Inverted Index • Text Search 작업에 쓰이는 기본 구조를 통틀어 부르는 용어 • 세세한 차이는 ranking function에 다라 달라짐 • 성능이 좋은 몇 개의 Inverted Index가 대부분의 엔진에서 쓰임 • 쓰이는 Inverted Index의 형식은 별 차이 없음 • 문서의 score를 계산하는 함수들이 비슷한 형식이기 때문

5.1 Overview • 특별한 구조를 쓰는 이유 • Array 단독으로도 작업 가능 • 단점 • Unsorted array는 검색에 느림 • Sorted array는 데이터 삽입에 느림 • Hash table, Tree structure • Array보다 복잡 • 검색과 데이터 삽입속도에 강점

5.1 Overview (cont.) • 자료구조 별 장점 • List : 아이템 저장만 한다면 List가 좋다 • Hash table: 빠른 검색 속도 • B-tree, Priority queues(Heap) : 복잡한 작업가능

5.2 Abstract Model of Ranking • Inverted Index에 대한 설명의 바탕이 되는 랭킹 모델 • Topical Features : 쿼리와 관련된 단어들의 가중치 • Quality features : 해당 문서의 품질에 관련된 수치 • Incoming link가 없고 업데이트된 지 오래 된 것은 쿼리와 낮은 매칭점수 • Feature Function은 여기에서는 다루지 않음

5.2 Abstract Model of Ranking • cont.

5.2 Abstract Model of Ranking • cont. Expanded query Parameters (중요하게 다루는 정도) 303.01 = {(9.7*5.2)+(4.2*3.4)+(22.1*9.9)} + {(14*1.2)+(3*0.9)}

5.2 Abstract Model of Ranking • cont. • 수백만개의 값들을 더해야 한다면 그 값이 매우 커질 수 있다 • 실제 검색엔진에서 gi(Q)는 0에 가까운 수로 만들어 더한 값을 0 조금 넘는 수가 되게 한다

5.3 Inverted Indexes • 3 types of inverted indexes • fields and extents • etc

5.3 Inverted Indexes • Inverted index • 현대 검색엔진에서 inverted index는 가장 효율적이고 견고한 구조로 인식 • Index term 으로 구성 • 각 term은 relevant data로 이루어진 inverted list와 연관 • List의 entry 각각을 posting이라고 부름 • posting은 연관된 document의 위치를 pointer로 저장 • 보통의 경우, Inverted list는 문서 번호로 정렬됨 • 복잡한 Ranking function은 계산에 더 많은 정보를 필요로 함 • Index를 변형시켜 데이터 추가 • 추가 메모리 공간, 계산 비용 필요 • 더 효과적인(effective) 랭킹 계산이 가능

5.3 Inverted Indexes • Inverted index pointer Index term posting • simple • with counts • with positions • fields and extents • other issues

5.3 Inverted Indexes Example “Collection”

5.3 Inverted Indexes Index term Doc no • 문서 내 빈도수는 없음 • idf계산 가능 Simple Inverted Index

5.3 Inverted Indexes Index term Doc No : tf • 문서 내 빈도수는 없음 • idf계산 가능 + tf정보 • Inverted Index with counts • supports better ranking algorithms

5.3 Inverted Indexes Index term Doc No, position • 문서 내 위치정보(몇 번째 단어인지) • Proximity, Phrase matching 가능 • Inverted Index with positions • supports proximity matches

5.3 Inverted Indexes • Proximity matching • ex) tropical fish (within 5 window size) result Doc# 1 Tropical fish 1, 1 2, 1 2, 13 Doc# 2 1, 2 1, 8 2, 3 3, 1 • Phrase matching 은 window size가 2인 proximity matching

5.3 Inverted Indexes • Fields and Extents Query: Tropical fish 문서 A 문서 B 제목이 Tropical fish 본문에서는 보통 수준 제목이 *Mauritius 본문에서는 tropical fish가 매우 자주 등장 Welcome to Mauritius *Mauritius : 마다가스카 옆에 있는 인도양의 휴양지

5.3 Inverted Indexes • Fields and Extents Query: Tropical fish Welcome to Mauritius • 첫번째 문서가 Tropical fish 가 더 있지만사람들은 제목에 Tropical fish가 있는 문서를 더 선호한다 문서의 특정 필드가 랭킹 하는데 포함된다

5.3 Inverted Indexes • Fields and Extents • 방법1 (extent list : 다루고자 하는 테이블에 대한 inverted list를 만든다 fish title fish donald 1, 1 1: (1,1) 1, 4 1: (4,0) 2, 7 2: (7,0) • 방법2 : 각 posting에 표시 (여기서 0은 body, 1은 title, 2는 author) 1: (1, 3) 1: (1,2) 2: (1,5) • 첫번째 문서가 Tropical fish 가 더 있지만사람들은 제목에 Tropical fish가 있는 문서를 더 선호한다

5.3 Inverted Indexes • Other issues • Scores : Query 처리때 해야 할 문서의 점수 계산(얼마나 해당 단어와 관련되어 있는지)을 미리 해둔다 • 장점 : scoring의 비용이 큰 경우 Query processing에 드는 전체 비용을 줄일 수 있다. • 단점 : index가 구성될 때 점수 계산이 되는데, 이 때문에 Collection이 수정돼도 index를 update하기 전 까지는 old-dated 값을 유지해야 한다. fish fish 1:2.2 3:3.6 3:3.6 1:2.2 • (2) Ordering : Score에 따라 문서 번호가 정렬된다 • single word의 쿼리일 때 top K개의 문서를 선택할 때 유리

5.4 Compression • concept • bit-aligned • byte-aligned • compression in practice

5.4 Compression • Concept • Inverted list의 크기를 줄인다 • Encoding 시 시간계산법 • 초당 처리하는 posting의 수 p/sec초당 가져오는 posting의 수 m/sec • 실제로 초당 처리하는 posting의수는 min(p,m) • m>p : 시스템이 거의 쉬지 않음 • m<p : 시스템이 기다려야 함

5.4 Compression • Concept • Inverted list의 크기를 줄인다 • Decoding 시 시간계산법 • 압축 비율 r로 압축되면 (예: r=2면 posting 1개 공간을 2개로 씀)1초에 m*r개의 posting을 가져옴 • 1초에decoding하는 posting 수 d*p (d<1) • 1초에 처리할 수 있는 개수는 min(mr, dp) • 좋은 압축 알고리즘은 r을 많이 높이고 d를 조금 낮춘다 • Ambiguous를 제거해야 한다

5.4 Compression • Concept • 예) Delta Encoding • D–gap 을 이용 • 어떤 단어가 나타난 문서 정보가1, 5, 9, 18, 23, 24, 30 이면 인접 문서번호간의 차를 이용해1,4,4,9,5,1,6 으로 인코딩 함 (d1,d2,d3) => (d1, d2-d1, d3-d2) • 문서 번호가 클 때 효과가 보임 (24,30) => (6) • 실제로 단어의 엔트로피에 의해서 보통은 문서에 자주 나타나지 않으므로큰 값의 문서 번호가 나타나고 d-gap의 효과가 나타난다(109,3875,4328,6195,7187,…) => (109,3766,453,1867,992,…) • [각 값을 2진수로 저장할 때 자리 수 에서 차이가 남]

5.4 Compression • Bit-Aligned Codes • Simple code : unary code • K를 k개의 1과한 개의 0(at end, makes code unambiguous)로 인코딩

5.4 Compression • Bit-Aligned Codes • Simple code : unary code • Unary code는 작은 수에서 효율적이지만 금새 코드가 expensive해진다 • 1023은 binary로 10개의 비트를 쓰지만 unary로 1024개의 비트 사용 • Binary는 큰 수에 대해 효울적이지만 엔트리 간 구분이 모호하다 • 0 1 0 3 0 2 0을 인코딩하는데 00은 1bit만 쓰기로 함 • 00 01 00 10 00 11 00 은 • 0 01 0 10 0 11 0 으로 변환 • 0010100110 • 디코딩 할 때 0 01 01 0 0 11 0 으로 읽을 수 있음 • 디코딩 된 값은 0 1 1 0 0 3 0 (원본: 0 1 0 3 0 2 0)

5.4 Compression • Bit-Aligned Codes • kd는 unary 다음에 오는 binary의 bit수 • Binary encoding이 끝의 구분이 모호하기 때문에 앞에 bit수 정보를 준다 • Elias- γ Code • K를 인코딩 하면 kd와 kr두 파트로 나누어 “unary binary” 형식으로 인코딩 • 큰 수에 약점

5.4 Compression • Byte-Aligned Codes • Processor가 byte를 처리하기 좋게 만들어져 있어서 실제 적용에서 빠르다 • V-byte code • 8bit중 아래부터 7개의 bit는 numeric data, 가장 높은 bit 1개는0 또는 1로 마지막 byte인지 표시

5.4 Compression • Compression in practice (Galago) • Word position 정보 이용 • (document, count, [positions]) • (1,2,[1,7]) (2,3,[6,17,197]) • document-gap • (1,2,[1,7]) (2,3,[6,17,197]) • =>(1,2,[1,7] (1,3,[6,17,197]) • (2) position-gap • (1,2,[1,7]) (1,3,[6,17,197]) • =>(1,2,[1,6] (1,3,[6,11,180]) => 1,2,1,6,1,3,6,11,180 • (3) v-byte • 81 82 81 86 81 83 86 8B 01 B4

Search Engines Information Retrieval in Practice

Search Engines Information Retrieval in Practice

Presentation Transcript

Information Retrieval and Search Engines

INFORMATION RETRIEVAL AND WEB SEARCH

Information Retrieval and Web Search

Information Retrieval in Practice

Information retrieval practice

Information Retrieval in Practice

Search Engines and Information Retrieval

Information Retrieval and Web Search

COMP4210: Information Retrieval and Search Engines Lecture 6: Evaluation

COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries and tolerant retrieval

Information Retrieval and Search Engines

Information Retrieval and Search Engines

COMP4210 Information Retrieval and Search Engines

Information Retrieval and Search Engines

Information Retrieval and Search Engines

Information Retrieval and Search Engines

Information Retrieval and Search Engines

CSCI5250/ENGG5106: Information Retrieval and Search Engines

Geographical Web Search Engines and Geographical Information Retrieval (GIR)

Web Search and Information Retrieval

Information Retrieval and Web Search