1 / 27

3. Weighting and Matching against Indices

3. Weighting and Matching against Indices. 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104. Microsopic Semantics and the Statistics of communication. Table 3.1 English Letter Frequency Character frequencies good for simple ciphers, crosswords. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF

Download Presentation

3. Weighting and Matching against Indices

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 3. Weighting and Matching against Indices 2007.1.20. 인공지능 연구실 송승미 Text : Finding out about Page:60-104

  2. Microsopic Semantics and the Statistics of communication • Table 3.1 English Letter Frequency • Character frequencies good for simple ciphers, crosswords.. UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ UZQSOVUOHXMOPVGPOZPEVSGZWSZOPF PESXUDBMETSXAIZVUEPHZHMDZSHZOWS FPAPPDTSVPQUZWTMXUZUHSXEPTEPOPD ZSZUFPOMBZWPFUPZHMDJUDTMOHMQ 빈도수 조사 결과 ] P : 16, Z : 14 E, T 에 해당될 가능성 높다.

  3. In this Chapter… • What are we counting? • What does the distribution of frequency occurrences across this level of features tell us about the pattern of their use? • What can we tell about the meaning of these features, based on such statistics? • How can we find meaning in text? • How are such attempts to be distinguished?

  4. Remember Zipf • 언어학자 George Kingsley Zipf • 영어로 된 책에 나오는 단어들을 모두 세어 빈도수 조사 • 미국 사람들이 가장 많이 사용하는 단어 the(1000) → of(500) → and(250) → to(125) • 자주 사용하는 단어는 소수에 불과, 다른 대부분의 단어들은 비슷하게 적은 횟수로 사용

  5. F(w) : the number of times word w occurs anywhere in the corpus • Sorted the vocabulary according to frequency Ex) r = 1 → the most frequently occuring word r = 2 → the next most frequently used word

  6. Zipf’s law • Empirical observation • F(r) : frequency of rank r word • F(r) = C / rα , α ≈ 1, C ≈ 0.1 ( Mathematical derivation of Zipf’s law – chapter 5 )

  7. Zipfian Distribution of AIT Words • Word frequency as function of its frequency rank • Log/log plot • Nearly linear • Negative slope

  8. Principle of Least Effort • Words as tools • Unification • Authors would like to use a single word, always • Diversification • Readers would like a unique word for each purpose • Vocabulary balance • Uses existing words, and avoid coining new ones

  9. WWW surfing behavior • A recent example of Zipf-like distributions

  10. A statistical Basis for Keyword Meaning Noisewords occurs very frequently Nonnoise word Internal keywords External keywords

  11. Word occurrence as a Poisson Process • Function words : of , the, but • Occur randomly throughout arbitrary text • Content words

  12. Resolving Power(1/2) • Repetition as an indication of emphasis • Resolving power = Ability of words to discriminate content • Maximal at middle rank • Thresholds to filter others • High frequency noise words • Low frequency, rare words

  13. Resolving Power(2/2) 쓰이는 횟수가 매우 드문 희귀한 단어들. 일반적인 문서 구분에는 도움이 되지 않는다. 문서에 너무 많이 등장하기 때문에 문서들을 구분하고 대표 하는데 별 의미 없음

  14. Language Distribution • Exhaustivity : Number of topics indexed • Specificity : Ability to describe FOA information need precisely • Index : A balance between user and corpus • Not too exhaustive, not too specific

  15. Exhaustivity ≈ N(Terms) assigned to Document • Exhaustive ▷ high recall, low precision • Document-oriented “representation” bias • Specificity ≈ -1 N(documents) assigned same term • Specific ▷ low recall, high precision • Query-oriented “discrimination” bias

  16. Specificity/Exhaustivity Trade-Offs

  17. Indexing Graph

  18. Weighting the Index Relation • Weight – strength of association with a single real number • The strength of the relationship between keyword and document.

  19. Informative Signals vs. Noise words • The least informative word (Noise words) • occurs uniformly across the corpus. • Ex) the • Informative Signals • Measure to weight of the keyword document

  20. Hypothetical Word Distributions rarely happens uniform distribution

  21. Inverse Document Frequency • Up to this point. • Really like to know is the number of documents containing a keyword ▷ IDF • IDF ( Inverse Document Frequency) • 전체 문서 중에서 키워드 k가 출현한 문서의 역수 • Comparison in terms of documents, not just word occurances • IDF ↑keyword k 를 포함하는 문서가 작다. IDF ↓ 〃 많다.

  22. Vector Space • Vector 를 이용하여 어느 기준 위치로부터 얼마만큼 어느 방향으로 떨어져 있는지 측정가능 • 어떻게 문서들간의 similarity 를 계산할 것이냐를 보다 더 수학적으로 접근해 보자는 의미에서 등장함.

  23. 검색 질의어 문서2 문서1 정보 • 단순하게 두개의 색인어를 기초로 한 2차원 평면 고려. • - (정보, 검색) 좌표계 • 문서 1 : D1(0.8,0.3 ) • 문서 2 : D2(0.2,0.7 ) • 질의어 : 정보검색 Q(0.4, 0.8) • 문서 1, 문서 2누가 더 가까울까?

  24. Keyword 3개 • Query 와 가장 가까운 문서는 D1

  25. Calculating TF-IDF Weighting • TF – Term frequency • IDF – Inverse document frequency • idf k = log ( Ndoc / Dk ) • W kd = F kd * idf k • F kd : the frequency with which keyword k occurs in docu. d • Ndoc : the total number of document in the corpus • Dk : the number of documents containing keyword k

  26. SMART Weighting Specification

  27. inverse squared probabilistic frequency

More Related