1 / 75

WBIA Review

WBIA Review. http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 1 2/24/2013. Bow-tie. Strongly Connected Component ( SCC ) Core Upstream ( IN ) Core can ’ t reach IN Downstream ( OUT ) OUT can ’ t reach core Disconnected Tendrils & Tubes. Power-law.

durin
Download Presentation

WBIA Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WBIA Review http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/24/2013

  2. Bow-tie Strongly Connected Component (SCC) Core Upstream (IN) Core can’t reach IN Downstream (OUT) OUT can’t reach core Disconnected Tendrils & Tubes

  3. Power-law Nature seems to create bell curves(range around an average) Human activity seems to create power laws(popularity skewing)

  4. Power Law Distribution -Examples From Graph structure in the web, (by altavista crawl,1999)

  5. 习题:怎么存储Web图? Web Graph

  6. PageRank Why and how it works?

  7. Random walker model u3 u1 V u4 u5 u2

  8. Damping Factor β选在0.1和0.2之间,被称作damping factor(Page & Brin 1997) G=(1-β)LT+ β/N(1N)被称为Google Matrix

  9. 小规模数据求解 β取0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 ... 。。。。。。。 Power Iteration求解得(迭代50次) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T You can try this in MatLab

  10. 习题:写出PageRank算法的伪码

  11. HITS(Hyperlink Induced Topic Search) 声望高的(入度大) 权威性高 认识许多声望高的(出度大)目录性强 如何计算? Power Iteration on:

  12. Authority and Hub scores 针对u∈V(q),在每个网页u上定义有两个参数:a[u]和h[u],分别表示其权威性和目录性。 交叉定义 一个网页u的a值依赖于指向它的网页v的h值 一个网页u的h值依赖于它所指的网页v的a值

  13. Web Spam • Term spamming • Manipulating the text of web pages in order to appear relevant to queries • Link spamming • Creating link structures that boost page rank or hubs and authorities scores

  14. TrustRank Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good t= b· LT ·t + (1-b)· d / |d| 1 2 3 4 good page 5 6 bad page 7

  15. TrustRank in Action Select seed set using inversed PageRank s=[2, 4, 5, 1, 3, 6, 7] Invoke L(=3) oracle functions Populate static score distribution vectord=[0, 1, 0, 1, 0, 0, 0] Normalize distribution vectord=[0, 1/2, 0, 1/2, 0, 0, 0] Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting RESULTS [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05] t= b· LT ·t + (1-b)· d / |d| 0 1 0.12 2 3 0.18 0.15 4 0.13 5 6 0.05 0.05 7

  16. Tokenization • Friends, Romans, Countrymen, lend me your ears; • Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)

  17. Stemming and lemmatization • Stemming • Crude heuristic process that chops off the ends of the words • Democratic  democa • Lemmatization • Use of vocabulary and morphological analysis, returns the base form of a word (lemma) • Democratic  democracy • Sang  sing

  18. Porter stemmer • Most common algorithm for stemming English • 5 phases of word reduction • SSES  SS • caresses  caress • IES  I • ponies  poni • SS  SS • S  • cats  cat • EMENT  • replacement  replac • cement  cement

  19. Bag of words model • A document can now be viewed as the collection of terms in it and their associated weight • Mary is smarter than John • John is smarter than Mary • Equivalent in the bag of words model

  20. Term frequency and weighting • A word that appears often in a document is probably very descriptive of what the document is about • Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document • Term frequency (tf) • Assign the weight to be equal to the number of occurrences of term t in document d

  21. Inverse document frequency • N number of documents in the collection • N = 1000; df[the] = 1000; idf[the] = 0 • N = 1000; df[some] = 100; idf[some] = 2.3 • N = 1000; df[car] = 10; idf[car] = 4.6 • N = 1000; df[merger] = 1; idf[merger] = 6.9

  22. it.idf weighting • Highest when t occurs many times within a small number of documents • Thus lending high discriminating power to those documents • Lower when the term occurs fewer times in a document, or occurs in many documents • Thus offering a less pronounced relevance signal • Lowest when the term occurs in virtually all documents

  23. tf x idf term weights tf x idf 权值计算公式: term frequency (tf ) or wf, some measure of term density in a doc inverse document frequency (idf ) 表达term的重要度(稀有度) 原始值idft = 1/dft 同样,通常会作平滑 为文档中每个词计算其tf.idf权重: 24

  24. Document vector space representation • Each document is viewed as a vector with one component corresponding to each term in the dictionary • The value of each component is the tf-idf score for that word • For dictionary terms that do not occur in the document, the weights are 0

  25. Documents as vectors 每一个文档 j能够被看作一个向量,每个term 是一个维度,取值为tf.idf So we have a vector space terms are axes docs live in this space 高维空间:即使作stemming, may have 20,000+ dimensions 26

  26. Cosine similarity

  27. Cosine similarity 向量d1和d2的“closeness”可以用它们之间的夹角大小来度量 具体的,可用cosine of the anglex来计算向量相似度. 向量按长度归一化Normalization t 3 d 2 d 1 θ t 1 t 2 28

  28. Jaccard coefficient Resemblance Symmetric, reflexive, not transitive, not a metric Note r(A,A) = 1 But r(A,B)=1 does not mean A and B are identical! Forgives any number of occurrences and any permutations of the terms. Resemblance distance

  29. Shingling A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D. D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is Why shingling? S(D,4) .vs. S(D,1) What is a good value for w?

  30. Doc1= "to be or not to be, that is a question!" Doc2= "to be a question or not" Shingling & Jaccard Coefficient • Let windows size w = 2, • Resemblancer(A,B) = ?

  31. Random permutation Random permutation Let  be a set (1..N e.g.) Pick a permutation :uniformly at random ={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=?  p 

  32. Inverted index 对每个 term T: 保存包含T的文档(编号)列表 中国 2 4 8 16 32 64 128 Postings 文化 1 2 3 5 8 13 21 34 留学生 13 16 Dictionary Sorted by docID (more later on why). 33

  33. Inverted Index • with counts • supports better ranking algorithms

  34. Sec. 6.4 VS-based Retrieval Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial?

  35. Sec. 6.4 tf-idf example: lnc.ltc Doc length = Document: car insurance auto insurance Query: best car insurance Exercise: what is N, the number of docs? Score = 0+0+0.27+0.53 = 0.8

  36. Singular Value Decomposition  DT Wtd T = r  r r  d t  d t  r • 对term-document矩阵作奇异值分解SingularValue Decomposition • r, 矩阵的rank • , singular values的对角阵(按降序排列) • D, T, 具有正交的单位长度列向量(TT’=I, DD’=I) WWT的特征值 WTW和WWT的特征向量

  37. Latent Semantic Model LSI检索过程: 查询映射/投影到LSI的DT空间,称为“folded in“ : W=TDT,若q投影到DT中后为q’,则有q = Tq’T 既有q’= (-1T-1q)T = qT-1 Folded in 既为 document/query vector 乘上T-1 文档集的文档向量为DT 两者通过dot-product计算相似度

  38. Stochastic Language Models 用来生成文本的统计模型 Probability distribution over strings in a given language P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, ) M

  39. Unigram model • likely topics • Bigram model • grammaticality

  40. Bigram Model • Approximate by • P(unicorn|the mythical) by P(unicorn|mythical) • Markov assumption: the probability of a word depends only on the probability of a limited history • Generalization: the probability of a word depends only on the probability of the n previous words • trigrams, 4-grams, … • the higher n is, the more data needed to train • backoff models…

  41. A Simple Example: bigram model • P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)

  42. LM-based Retrieval 排序公式 用最大似然估计: : language model of document d : raw tf of term t in document d : total number of tokens in document d Unigram assumption: Given a particular language model, the query terms occur independently

  43. Laplace smoothing • Also called add-one smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate:

  44. Mixture model smoothing P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要  值高,使得查询成为 “conjunctive-like” – 适合短查询  值低更适合长查询 调整 来优化性能 比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)

  45. Example Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents;  = ½ Query: revenue down P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256 Ranking: d1 > d2

  46. What is relative entropy? • KL divergence/relative entropy

  47. Kullback-Leibler Divergence • Relative entropy between the two distributions • Cost in bits of coding using Q when true distribution is P 48

  48. Kullback-Leibler Divergence 49

  49. Precision and Recall Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved) Recall: 相关文档被检索出来的比率 = P(retrieved|relevant) 精度Precision P = tp/(tp + fp) 召回率Recall R = tp/(tp + fn) 50

More Related