3. Weighting and Matching against Indices

3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104

Microsopic Semantics and the Statistics of communication • Table 3.1 English Letter Frequency • Character frequencies good for simple ciphers, crosswords.. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ 빈도수 조사 결과 ] P : 16, Z : 14 E, T 에 해당될 가능성 높다.

In this Chapter… • What are we counting? • What does the distribution of frequency occurrences across this level of features tell us about the pattern of their use? • What can we tell about the meaning of these features, based on such statistics? • How can we find meaning in text? • How are such attempts to be distinguished?

Remember Zipf • 언어학자 George Kingsley Zipf • 영어로 된 책에 나오는 단어들을 모두 세어 빈도수 조사 • 미국 사람들이 가장 많이 사용하는 단어 the(1000) → of(500) → and(250) → to(125) • 자주 사용하는 단어는 소수에 불과, 다른 대부분의 단어들은 비슷하게 적은 횟수로 사용

F(w) : the number of times word w occurs anywhere in the corpus • Sorted the vocabulary according to frequency Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word

Zipf’s law • Empirical observation • F(r) : frequency of rank r word • F(r) = C / rα , α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf’s law – chapter 5 )

Zipfian Distribution of AIT Words • Word frequency as function of its frequency rank • Log/log plot • Nearly linear • Negative slope

Principle of Least Effort • Words as tools • Unification • Authors would like to use a single word, always • Diversification • Readers would like a unique word for each purpose • Vocabulary balance • Uses existing words, and avoid coining new ones

WWW surfing behavior • A recent example of Zipf-like distributions

A statistical Basis for Keyword Meaning Noisewords occurs very frequently Nonnoise word Internal keywords External keywords

Word occurrence as a Poisson Process • Function words : of , the, but • Occur randomly throughout arbitrary text • Content words

Resolving Power(1/2) • Repetition as an indication of emphasis • Resolving power = Ability of words to discriminate content • Maximal at middle rank • Thresholds to filter others • High frequency noise words • Low frequency, rare words

Resolving Power(2/2) 쓰이는 횟수가 매우 드문 희귀한 단어들. 일반적인 문서 구분에는 도움이 되지 않는다. 문서에 너무 많이 등장하기 때문에 문서들을 구분하고 대표 하는데 별 의미 없음

Language Distribution • Exhaustivity : Number of topics indexed • Specificity : Ability to describe FOA information need precisely • Index : A balance between user and corpus • Not too exhaustive, not too specific

Exhaustivity ≈ N(Terms) assigned to Document • Exhaustive ▷ high recall, low precision • Document-oriented “representation” bias • Specificity ≈ -1 N(documents) assigned same term • Specific ▷ low recall, high precision • Query-oriented “discrimination” bias

Specificity/Exhaustivity Trade-Offs

Indexing Graph

Weighting the Index Relation • Weight – strength of association with a single real number • The strength of the relationship between keyword and document.

Informative Signals vs. Noise words • The least informative word (Noise words) • occurs uniformly across the corpus. • Ex) the • Informative Signals • Measure to weight of the keyword document

Hypothetical Word Distributions rarely happens uniform distribution

Inverse Document Frequency • Up to this point. • Really like to know is the number of documents containing a keyword ▷ IDF • IDF ( Inverse Document Frequency) • 전체 문서 중에서 키워드 k가 출현한 문서의 역수 • Comparison in terms of documents, not just word occurances • IDF ↑keyword k 를 포함하는 문서가 작다. IDF ↓ 〃 많다.

Vector Space • Vector 를 이용하여 어느 기준 위치로부터 얼마만큼 어느 방향으로 떨어져 있는지 측정가능 • 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보자는 의미에서 등장함.

검색 질의어 문서2 문서1 정보 • 단순하게 두개의 색인어를 기초로 한 2차원 평면 고려. • - (정보, 검색) 좌표계 • 문서 1 : D1(0.8,0.3 ) • 문서 2 : D2(0.2,0.7 ) • 질의어 : 정보검색 Q(0.4, 0.8) • 문서 1, 문서 2누가 더 가까울까?

Keyword 3개 • Query 와 가장 가까운 문서는 D1

Calculating TF-IDF Weighting • TF – Term frequency • IDF – Inverse document frequency • idf k = log ( Ndoc / Dk ) • W kd = F kd * idf k • F kd : the frequency with which keyword k occurs in docu. d • Ndoc : the total number of document in the corpus • Dk : the number of documents containing keyword k

SMART Weighting Specification

inverse squared probabilistic frequency

3. Weighting and Matching against Indices

3. Weighting and Matching against Indices

Presentation Transcript

Indices and Surds

Weighting Objectives

Surds and indices

Weighting

Indices and Scales

Weighting and Estimation

Baby weighting

Weighting Schemes

Monomials and Indices

Weighting and imputation

Lecture 3 : Term Weighting

Particle Weighting

Weighting and Matching against Indices

Model Calibration and Weighting

Pattern Matching against Distributed Datasets within DAME

Graph to plot anthropometric indices against reference population

Powers and Indices

Chapter 3: Miller Indices

Light weighting

Algebra and Indices

Weighting and imputation