140 likes | 273 Views
Genetic Learning for Information Retrieval. Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140. X. Genetic Learning. The Core Algorithm Crossover, Mutation, Reproduction Fitness proportionate selection Genetic Algorithms Chromosome is an array Genetic Programming
E N D
Genetic Learning forInformation Retrieval Andrew Trotman Computer Science 365 * 24 * 60 / 40 = 13,140
X Genetic Learning • The Core Algorithm • Crossover, Mutation, Reproduction • Fitness proportionate selection • Genetic Algorithms • Chromosome is an array • Genetic Programming • Chromosome isan abstract syntax tree {A B C D E F} X {1 2 3 4 5 6}
Information Retrieval (Text) • Online Systems • Dialog, LexisNexis, etc. • Web Systems • Alta Vista, Excite, Google, etc. • Scientific Literature Systems • CiteSeer, PubMed, BioMedNet, etc. • Question: • How should scientific literature be ranked? • Less time searching / More time researching • Higher exposure for “good” work
How Google Works • PageRank • Document ranking from PageRank • A document’s PageRank is some factor (d) of the rank of incoming citations • A document’s influence is some factor of its rank and its outgoing citations • Characteristics of Scientific Literature • Citations unidirectional (backwards in time) • 12 month publication cycle • Scientific citation “cliques”
postings dictionary Record1: Of OtagoRecord2: Otago UniversityRecord3: OtagoRecord4: Of OF <1,1><4,1> OTAGO <2,1><3,1> UNIVERSITY <2,1> How IR works • Indexing • Build the dictionary • Construct the Postings (<d,f> pairs) • Searching • Look up terms in dictionary • Boolean resolution • Rank on density (probability, vector space, etc.) • Performance • Recall and precision
doc:1 docid:2 place:3 cntry:5 sport:6 name:4 rank:7 <doc><docid>1</docid><place><name>University of Otago</name></place><cntry>New Zealand</cntry></doc> <doc><docid>2</docid><cntry>New Zealand</cntry><sport>sailing</sport></doc> <doc><docid>3</docid><place><name>University of Otago</name><rank>top</rank></place></doc> Structured-IR • Sci-Lit documents have structure • Title, abstract, conclusions, etc. • <d,f> becomes <d,p,f>
Using Structure in Ranking • Documents have structure • Title, Abstract, Conclusions, etc. • Weight each structure on “importance” • Title higher than Abstract higher than … • How to choose the weights • Specified in the query (XIRQL) • Query feedback • Learn with a Genetic Algorithm • Adapt ranking model to use structure • Each tree node is a locus • Weights are genes
50 training queries 50 evaluation queries 25 generations Probabilistic IR Vector Space IR PROBABILISTIC IR 75.5% queries improved 6.7% increase in MAP (8.8% max) VECTOR SPACE IR 61% queries improved 4.7% increase in MAP (5.4% max) Experiment Results
Ranking Algorithms • Multitude exist • Probability, vector space, Boolean • Several published nomenclatures • Over 100,000 “published” algorithms • Purpose • Put relevant documents first • Sorting • Performance measures with precision • Sources • Some guy thought it up
50 training queries 50 evaluation queries 31 runs Weekend time limit Compare to Probabilistic 67% queries improved 15% increase in MAP Experiment Results
Function Comparison Vector Space Probability Learned wdq=StÎq(((((((((U / sqrt(sqrt(nt))) / (mq / sqrt((((Lq / (sqrt(sqrt(Ld)) / sqrt((U / nc)))) * min(mq, N)) / sqrt(((((((Tmax / sqrt(U)) / sqrt((((log2(sqrt(nt)) / sqrt(nt)) / sqrt(Umax)) / (M / nc)))) / sqrt((U / nc))) - uq) / mq) / sqrt(nt))))))) / sqrt((log(Tmax) / nc))) / sqrt(nt)) / sqrt(nt)) / sqrt((Lq / sqrt(((sqrt((sqrt(sqrt(Ld)) / sqrt((min(mq, sqrt((((log(Tmax) / nc) / sqrt(Umax)) / (mq / sqrt(((N * min((sqrt(nc) / sqrt(U)), Ld)) / sqrt(N))))))) / sqrt(Ld))))) / sqrt((Tmax / nc))) / sqrt(nt)))))) / sqrt((min(mq, N) / nc))) / sqrt((log(Tmax) / nc))) / sqrt(nt))
Conclusions • Using document structure improved ranking • Structure weights can be learned with a GA • GP can be used to learn ranking functions Speculation • Combining GA and GP to learn a structure ranking algorithm will better GA and GP alone
Random NumbersAre your results an artifact of your random number generator?