750 likes | 981 Views
WBIA Review. http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 1 2/24/2013. Bow-tie. Strongly Connected Component ( SCC ) Core Upstream ( IN ) Core can ’ t reach IN Downstream ( OUT ) OUT can ’ t reach core Disconnected Tendrils & Tubes. Power-law.
E N D
WBIA Review http://net.pku.edu.cn/~wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/24/2013
Bow-tie Strongly Connected Component (SCC) Core Upstream (IN) Core can’t reach IN Downstream (OUT) OUT can’t reach core Disconnected Tendrils & Tubes
Power-law Nature seems to create bell curves(range around an average) Human activity seems to create power laws(popularity skewing)
Power Law Distribution -Examples From Graph structure in the web, (by altavista crawl,1999)
习题:怎么存储Web图? Web Graph
PageRank Why and how it works?
Random walker model u3 u1 V u4 u5 u2
Damping Factor β选在0.1和0.2之间,被称作damping factor(Page & Brin 1997) G=(1-β)LT+ β/N(1N)被称为Google Matrix
小规模数据求解 β取0.15 G= 0.85*LT+0.15/11(1N) P0=(1/11,1/11,….)T P1=GP0 ... 。。。。。。。 Power Iteration求解得(迭代50次) P=(0.033,0.384,0.343,0.039,0.081, 0.039,0.016……)T You can try this in MatLab
HITS(Hyperlink Induced Topic Search) 声望高的(入度大) 权威性高 认识许多声望高的(出度大)目录性强 如何计算? Power Iteration on:
Authority and Hub scores 针对u∈V(q),在每个网页u上定义有两个参数:a[u]和h[u],分别表示其权威性和目录性。 交叉定义 一个网页u的a值依赖于指向它的网页v的h值 一个网页u的h值依赖于它所指的网页v的a值
Web Spam • Term spamming • Manipulating the text of web pages in order to appear relevant to queries • Link spamming • Creating link structures that boost page rank or hubs and authorities scores
TrustRank Expecting that good pages point to other good pages, all pages reachable from a good seed page in M or fewer steps are denoted as good t= b· LT ·t + (1-b)· d / |d| 1 2 3 4 good page 5 6 bad page 7
TrustRank in Action Select seed set using inversed PageRank s=[2, 4, 5, 1, 3, 6, 7] Invoke L(=3) oracle functions Populate static score distribution vectord=[0, 1, 0, 1, 0, 0, 0] Normalize distribution vectord=[0, 1/2, 0, 1/2, 0, 0, 0] Calculate TrustRank scores using biased PageRank with trust dampening and trust splitting RESULTS [0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05] t= b· LT ·t + (1-b)· d / |d| 0 1 0.12 2 3 0.18 0.15 4 0.13 5 6 0.05 0.05 7
Tokenization • Friends, Romans, Countrymen, lend me your ears; • Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)
Stemming and lemmatization • Stemming • Crude heuristic process that chops off the ends of the words • Democratic democa • Lemmatization • Use of vocabulary and morphological analysis, returns the base form of a word (lemma) • Democratic democracy • Sang sing
Porter stemmer • Most common algorithm for stemming English • 5 phases of word reduction • SSES SS • caresses caress • IES I • ponies poni • SS SS • S • cats cat • EMENT • replacement replac • cement cement
Bag of words model • A document can now be viewed as the collection of terms in it and their associated weight • Mary is smarter than John • John is smarter than Mary • Equivalent in the bag of words model
Term frequency and weighting • A word that appears often in a document is probably very descriptive of what the document is about • Assign to each term in a document a weight for that term, that depends on the number of occurrences of the that term in the document • Term frequency (tf) • Assign the weight to be equal to the number of occurrences of term t in document d
Inverse document frequency • N number of documents in the collection • N = 1000; df[the] = 1000; idf[the] = 0 • N = 1000; df[some] = 100; idf[some] = 2.3 • N = 1000; df[car] = 10; idf[car] = 4.6 • N = 1000; df[merger] = 1; idf[merger] = 6.9
it.idf weighting • Highest when t occurs many times within a small number of documents • Thus lending high discriminating power to those documents • Lower when the term occurs fewer times in a document, or occurs in many documents • Thus offering a less pronounced relevance signal • Lowest when the term occurs in virtually all documents
tf x idf term weights tf x idf 权值计算公式: term frequency (tf ) or wf, some measure of term density in a doc inverse document frequency (idf ) 表达term的重要度(稀有度) 原始值idft = 1/dft 同样,通常会作平滑 为文档中每个词计算其tf.idf权重: 24
Document vector space representation • Each document is viewed as a vector with one component corresponding to each term in the dictionary • The value of each component is the tf-idf score for that word • For dictionary terms that do not occur in the document, the weights are 0
Documents as vectors 每一个文档 j能够被看作一个向量,每个term 是一个维度,取值为tf.idf So we have a vector space terms are axes docs live in this space 高维空间:即使作stemming, may have 20,000+ dimensions 26
Cosine similarity 向量d1和d2的“closeness”可以用它们之间的夹角大小来度量 具体的,可用cosine of the anglex来计算向量相似度. 向量按长度归一化Normalization t 3 d 2 d 1 θ t 1 t 2 28
Jaccard coefficient Resemblance Symmetric, reflexive, not transitive, not a metric Note r(A,A) = 1 But r(A,B)=1 does not mean A and B are identical! Forgives any number of occurrences and any permutations of the terms. Resemblance distance
Shingling A contiguous subsequence contained in D is called a shingle. Given a document D we define its w-shingling S(D, w) as the set of all unique shingles of size w contained in D. D = (a,rose,is,a,rose,is,a,rose) S(D,4) = {(a,rose,is,a),(rose,is,a,rose),(is,a,rose,is)} “a rose is a rose is a rose” => a_rose_is_a rose_is_a_rose is_a_rose_is Why shingling? S(D,4) .vs. S(D,1) What is a good value for w?
Doc1= "to be or not to be, that is a question!" Doc2= "to be a question or not" Shingling & Jaccard Coefficient • Let windows size w = 2, • Resemblancer(A,B) = ?
Random permutation Random permutation Let be a set (1..N e.g.) Pick a permutation :uniformly at random ={3,7,1,4,6,2,5} A={2,3,6} MIN((A))=? p
Inverted index 对每个 term T: 保存包含T的文档(编号)列表 中国 2 4 8 16 32 64 128 Postings 文化 1 2 3 5 8 13 21 34 留学生 13 16 Dictionary Sorted by docID (more later on why). 33
Inverted Index • with counts • supports better ranking algorithms
Sec. 6.4 VS-based Retrieval Columns headed ‘n’ are acronyms for weight schemes. Why is the base of the log in idf immaterial?
Sec. 6.4 tf-idf example: lnc.ltc Doc length = Document: car insurance auto insurance Query: best car insurance Exercise: what is N, the number of docs? Score = 0+0+0.27+0.53 = 0.8
Singular Value Decomposition DT Wtd T = r r r d t d t r • 对term-document矩阵作奇异值分解SingularValue Decomposition • r, 矩阵的rank • , singular values的对角阵(按降序排列) • D, T, 具有正交的单位长度列向量(TT’=I, DD’=I) WWT的特征值 WTW和WWT的特征向量
Latent Semantic Model LSI检索过程: 查询映射/投影到LSI的DT空间,称为“folded in“ : W=TDT,若q投影到DT中后为q’,则有q = Tq’T 既有q’= (-1T-1q)T = qT-1 Folded in 既为 document/query vector 乘上T-1 文档集的文档向量为DT 两者通过dot-product计算相似度
Stochastic Language Models 用来生成文本的统计模型 Probability distribution over strings in a given language P ( | M ) = P ( | M ) P ( | M, ) P ( | M, ) P ( | M, ) M
Unigram model • likely topics • Bigram model • grammaticality
Bigram Model • Approximate by • P(unicorn|the mythical) by P(unicorn|mythical) • Markov assumption: the probability of a word depends only on the probability of a limited history • Generalization: the probability of a word depends only on the probability of the n previous words • trigrams, 4-grams, … • the higher n is, the more data needed to train • backoff models…
A Simple Example: bigram model • P(I want to each Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food)
LM-based Retrieval 排序公式 用最大似然估计: : language model of document d : raw tf of term t in document d : total number of tokens in document d Unigram assumption: Given a particular language model, the query terms occur independently
Laplace smoothing • Also called add-one smoothing • Just add one to all the counts! • Very simple • MLE estimate: • Laplace estimate:
Mixture model smoothing P(w|d) = Pmle(w|Md) + (1 – )Pmle(w|Mc) 参数很重要 值高,使得查询成为 “conjunctive-like” – 适合短查询 值低更适合长查询 调整 来优化性能 比如使得它与文档长度相关 (cf. Dirichlet prior or Witten-Bell smoothing)
Example Document collection (2 documents) d1: Xerox reports a profit but revenue is down d2: Lucent narrows quarter loss but revenue decreases further Model: MLE unigram from documents; = ½ Query: revenue down P(Q|d1) = [(1/8 + 2/16)/2] x [(1/8 + 1/16)/2] = 1/8 x 3/32 = 3/256 P(Q|d2) = [(1/8 + 2/16)/2] x [(0 + 1/16)/2] = 1/8 x 1/32 = 1/256 Ranking: d1 > d2
What is relative entropy? • KL divergence/relative entropy
Kullback-Leibler Divergence • Relative entropy between the two distributions • Cost in bits of coding using Q when true distribution is P 48
Precision and Recall Precision: 检索得到的文档中相关的比率 = P(relevant|retrieved) Recall: 相关文档被检索出来的比率 = P(retrieved|relevant) 精度Precision P = tp/(tp + fp) 召回率Recall R = tp/(tp + fn) 50