270 likes | 411 Views
3. Weighting and Matching against Indices. 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104. Microsopic Semantics and the Statistics of communication. Table 3.1 English Letter Frequency Character frequencies good for simple ciphers, crosswords. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF
E N D
3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104
Microsopic Semantics and the Statistics of communication • Table 3.1 English Letter Frequency • Character frequencies good for simple ciphers, crosswords.. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ 빈도수 조사 결과 ] P : 16, Z : 14 E, T 에 해당될 가능성 높다.
In this Chapter… • What are we counting? • What does the distribution of frequency occurrences across this level of features tell us about the pattern of their use? • What can we tell about the meaning of these features, based on such statistics? • How can we find meaning in text? • How are such attempts to be distinguished?
Remember Zipf • 언어학자 George Kingsley Zipf • 영어로 된 책에 나오는 단어들을 모두 세어 빈도수 조사 • 미국 사람들이 가장 많이 사용하는 단어 the(1000) → of(500) → and(250) → to(125) • 자주 사용하는 단어는 소수에 불과, 다른 대부분의 단어들은 비슷하게 적은 횟수로 사용
F(w) : the number of times word w occurs anywhere in the corpus • Sorted the vocabulary according to frequency Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word
Zipf’s law • Empirical observation • F(r) : frequency of rank r word • F(r) = C / rα , α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf’s law – chapter 5 )
Zipfian Distribution of AIT Words • Word frequency as function of its frequency rank • Log/log plot • Nearly linear • Negative slope
Principle of Least Effort • Words as tools • Unification • Authors would like to use a single word, always • Diversification • Readers would like a unique word for each purpose • Vocabulary balance • Uses existing words, and avoid coining new ones
WWW surfing behavior • A recent example of Zipf-like distributions
A statistical Basis for Keyword Meaning Noisewords occurs very frequently Nonnoise word Internal keywords External keywords
Word occurrence as a Poisson Process • Function words : of , the, but • Occur randomly throughout arbitrary text • Content words
Resolving Power(1/2) • Repetition as an indication of emphasis • Resolving power = Ability of words to discriminate content • Maximal at middle rank • Thresholds to filter others • High frequency noise words • Low frequency, rare words
Resolving Power(2/2) 쓰이는 횟수가 매우 드문 희귀한 단어들. 일반적인 문서 구분에는 도움이 되지 않는다. 문서에 너무 많이 등장하기 때문에 문서들을 구분하고 대표 하는데 별 의미 없음
Language Distribution • Exhaustivity : Number of topics indexed • Specificity : Ability to describe FOA information need precisely • Index : A balance between user and corpus • Not too exhaustive, not too specific
Exhaustivity ≈ N(Terms) assigned to Document • Exhaustive ▷ high recall, low precision • Document-oriented “representation” bias • Specificity ≈ -1 N(documents) assigned same term • Specific ▷ low recall, high precision • Query-oriented “discrimination” bias
Weighting the Index Relation • Weight – strength of association with a single real number • The strength of the relationship between keyword and document.
Informative Signals vs. Noise words • The least informative word (Noise words) • occurs uniformly across the corpus. • Ex) the • Informative Signals • Measure to weight of the keyword document
Hypothetical Word Distributions rarely happens uniform distribution
Inverse Document Frequency • Up to this point. • Really like to know is the number of documents containing a keyword ▷ IDF • IDF ( Inverse Document Frequency) • 전체 문서 중에서 키워드 k가 출현한 문서의 역수 • Comparison in terms of documents, not just word occurances • IDF ↑keyword k 를 포함하는 문서가 작다. IDF ↓ 〃 많다.
Vector Space • Vector 를 이용하여 어느 기준 위치로부터 얼마만큼 어느 방향으로 떨어져 있는지 측정가능 • 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보자는 의미에서 등장함.
검색 질의어 문서2 문서1 정보 • 단순하게 두개의 색인어를 기초로 한 2차원 평면 고려. • - (정보, 검색) 좌표계 • 문서 1 : D1(0.8,0.3 ) • 문서 2 : D2(0.2,0.7 ) • 질의어 : 정보검색 Q(0.4, 0.8) • 문서 1, 문서 2누가 더 가까울까?
Keyword 3개 • Query 와 가장 가까운 문서는 D1
Calculating TF-IDF Weighting • TF – Term frequency • IDF – Inverse document frequency • idf k = log ( Ndoc / Dk ) • W kd = F kd * idf k • F kd : the frequency with which keyword k occurs in docu. d • Ndoc : the total number of document in the corpus • Dk : the number of documents containing keyword k
inverse squared probabilistic frequency