240 likes | 431 Views
Artificial Intelligence and Data Mining in Information Retrieval. 31.01.2011 Presentation by Volker Rehberg University of Konstanz. Agenda. ->Agenda Definitions Indexing Classification Clustering Feedback & Ranking Conclusion.
E N D
Artificial Intelligence and Data Mining in Information Retrieval 31.01.2011 Presentation by Volker Rehberg University of Konstanz
Agenda ->Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion Overall Goal: most important AI/ Data Mining methods for Information Retrieval & most valuable impact. • Definition Information Retrieval vs. Artificial Intelligence and Data Mining Indexing and Dimension Reduction: • Term Vector Model • Dimension Reduction by PCA • Latent Semantic Analysis Classification: • Support Vector Machines • Bayes Classifier • Fuzzy Classification
Agenda ->Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion Clustering: • Query Reformulation • Document Clustering for Presentation Relevance Feedback and Ranking: • with Neuronal Networks Summary and Conclusion
Definitions Agenda ->DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion • Definition Information Retrieval: “Information retrieval (IR) is finding material (usually documents) … that satisfies an information need fromwithin large collections…” Christopher Manning [1] four distinct phases: • indexing • query formulation • comparison • feedback [2]
Definitions Agenda ->DefinitionsIndexingClassification Clustering Feedback & Ranking Conclusion Definition ArtificialIntelligence: „Is thescienceandengineeringofmaking intelligent machines.“ John McCarthy [3] • Soft AI hypothesis: Machines canbehave intelligent. • Strong Artificial AI hypothesis: Machines arereallyabletothink [4] Definition Data Mining: „Data Mining is …generatingknowledgefromdataandit‘spresentation. It‘s … originated in statisticsorartificialintelligenceandshouldbeapplicabletolarge databases...“Wolfgang Ertel [5]
Term Vector Model Agenda Definitions->IndexingClassification Clustering Feedback & Ranking Conclusion Boolean or numeric vector of appearances of words in documents Documents similar if they lie close together in vector space. Retrieval by distance or angle between query and document Term Document Matrix has large amountof dimensions • Problem: highcostforprocessing • Need forreducingdimensions [1]
Dimension Reductionby PCA Agenda Definitions->IndexingClassification Clustering Feedback & Ranking Conclusion Advantage: Processing on a reducedanduncorrelatedmatrix PCA isrotationofdata in space (bynewcoordinatesystem), so that: • thefirstdimensionsstoremostoftheinformation • 1st dimensionhashighest Eigenvalue (importance) • Tomainapproaches: varianceapproach vs. errorapproach • Weget a byrelevanceorderedsetofuncorrelateddimensions • After thatwecanreducethedimensionality [6]
Latent SemanticIndexing Agenda Definitions ->IndexingClassification Clustering Feedback & Ranking Conclusion • LSI isreducingdimensionsoftd- matrixbygroupingtermstoconcepts Advantages: • Dimension reductionofterm-documentmatrix • Documentcanberetrievedevenifitdoes not containquerywords • LSI one applies Singular Value Decomposition to a term-document matrix • Models conceptsrelatedtotermsandconceptsrelatedtodocuments. [1]
Classification Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion Classification is assigning an object (e.g. text or audio document) to a distinct class Advantages for Indexing, Query Formulation, Comparison, Ranking Example Benefits for Information Retrieval: • Identify the language of a document • Identify spam pages and do not index them [1] • Identify collocations (e.g. „New York“, „Data Mining“, ) andindexthemtogether • Categorizization in library cataloging system or Newswire stories [7] • Categorization of multimedia content (e.g. audio, video) [8] • Classifyquery, toseewhichtext „category“ userissearchingfor • Relevance ranking (classes relevant / non relevant) [1]
Classification Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion Several methods: • Support Vector Machines • Naïve Bayes • K-Nearest Neighbour • Decision Trees • Neuronal Networks • Genetic Algorithms Supervised (often used) vs. unsupervised learning (rarely used): • Unsupervised learning require no training but much more computation-intensive than supervised schemes. [1]
Support Vector Machines Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion Dividedocumentsinto 2 classes bydrawing Hyperplane withmaximummargintosupportvectors. Application: Video Retrieval [8]
BayesClassifier Agenda DefinitionsIndexing->ClassificationClustering Feedback & Ranking Conclusion • The theoremstates: • Example Spamfilter: • Was in the 1960th used for first ranking systems • Implementation of probabilistic model [10] P(A) is the prior probability (also called "unconditional“ probability) P(A|B) is the conditional probability of A, given B (also calledposteriorprobability) P(B|A) is the conditional probability of B, given A (also calledlikelihood) P(B) is the prior probability [9]
FuzzyClassification Agenda DefinitionsIndexing->Classification Clustering Feedback & Ranking Conclusion • In standardlogic: somethingistrue/not true • In fuzzylogic: canbetrueorfalseto a certaindegree [6] ExampleApplicationandBenfitsto IR: • FuzzyClassification (e.g. relevant/ not relevant Ranking) • Fuzzy Clustering
Clustering Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion Clustering is finding groups of similar objects Term Clustering: Cluster search terms by appearance in documents and add similar terms to query. Advantage: • Query expand-> increase Recall Document Clustering: Find similar documents with respect to relevance to information needs. [1] Advantages: Retrieve similar documents Advance presentation of documents
Query Reformulationby Clustering Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion • Query = „training“ & „aircraftcaptain„ • Does not returndocumentswithonlyterm „pilot“ increaserecallbyaddingrelatedterm „pilot“ toquery Association Cluster: termsthatoftenappear in the same documentsaresemanticallyrelatedorsynonyms [11]
Document Clustering forPresentation Agenda DefinitionsIndexingClassification->Clustering Feedback & Ranking Conclusion www.yippy.com Advantage: Clear out ambiguitybysemanticclustering
Ranking and Feedback with NN Agenda DefinitionsIndexingClassification Clustering ->Feedback & Ranking Conclusion Neurons connectedinto a network Neurons have: • Input andoutputvalues • Activationfunction • Weightedconnections Implementationfor: • Vectorspace model • Probabilistic model • Boolean model
Ranking and Feedback with NN Agenda DefinitionsIndexingClassification Clustering ->Feedback & Ranking Conclusion 3 layersofneuronsconnectedthroughweights. Input layer: query Hidden layer: terms Output layer: documents Query: Propagation frominputto hiddenlayer Feedback: Backpropagation Relevanceranking Relevancefeedback [12] [13]
SummaryandConclusion Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion • AI &DM have a highinfluence on all processesof Information Retrieval • Multimedia Retrievalishereoneofthemostinterestingfieldsofresearchatthemoment. Interested in AI/DM? -> gotothecourses: • „Data Mining“ 1 & 2 • „ComputationalMethodsforDocument Analysis“
Resources Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion • [1] Manning, Christopher; Raghavan, Prabhakar; Schütze, Hinrich : “Introduction to Information Retrieval“ by Manning, Raghavan, Schütze • [2 ]Lewis, D.D. (1991) “Learning in intelligent information retrieval.” Proceedings of the International Workshop on Machine Learning, (Evanston, Illinois), pp. 235–239. • [3] MCCarthy, John: “What is artificial Intelligence?”, www-formal.stanford.edu/jmc/whatisai/node, 2007 • [4 ]Russell, Stuart; Norvig Peter: “Artificial Intelligence: A modern Approach”, 3.Edition, Prentice Hall 2010 • [5] Ertel, Wolfgang: „Grundkurs Künstliche Intelligenz: eine praxisorientierte Einführung“, 2. ed. Vieweg und Teubner, 2009 • [6] Berthold, Michael; Hand, David J. : „Intelligent Data Analysis : An Introduction„, 2.ed., 2007 • [7] Cunningham, S.J., Littin, J.N. and Witten: “Applications of machine learning in information retrieval.” : Annual Review of Information Science and Technology, edited by M.E. Williams, pp. 341-419. American Society for Information Science
Resources Agenda DefinitionsIndexingClassification Clustering Feedback & Ranking ->Conclusion • [8]Blanken, Henk; Vries, Arjen P.; Blok, Hernk Ernst; Feng, Ling: „Multimedia Retrieval“, Springer, 2007 • [9] Mansmann, F. ; Berthold, M.; Keim, D. : „Data Mining Foundations: FindingExplanations“ Vorlesungsfolien, Universität Konstanz, 2011 • [10] Stock, Wolfgang G.: „Information Retrieval: Informationen suchen und finden“, Oldenburg Wissenschaftsverlag, 2007 • [11] Eigenstuhler , Gerald; Hubmann, Alexander; Wischounig, Daniel: „ Information SearchandRetrieval Vorlesungsblock 05: Query Reformulation, AI in Information Retrieval „ , Graz: Institut für Informationssysteme und Computer Medien, http://www.iicm.tu-graz.ac.at/isr/vo/inhalte/block_05/block05.htm#automatic_local_analysis, Jan 2010 • [12] Sigel, Christian: „Inferenznetzwerke und Neuronale Netze im Information Retrieval“, Johannes Gutenberg-Universitat Mainz, 2010, http://www.informatik.uni-mainz.de/lehre/ir/seminar-wise-0910/Sigel-INNN-Folien.pdf • [13] Hsinchun Chen: “Machine Learning for Information Retrieval: Neuronal Networks, Symbolic Learning, and Genetic Algorithms”, JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 46(3):194-216, 1995