120 likes | 272 Views
ARISTOTLE UNIVERSITY OF THESSALONIKI. NATURAL LANGUAGE REQUEST ANALYSIS COMPONENT. ARISTOTLE UNIVERSITY OF THESSALONIKI. Goal: to develop a document retrieval system based on statistical natural language processing. Steps implemented: Corpus acquisition
E N D
ARISTOTLE UNIVERSITY OF THESSALONIKI NATURAL LANGUAGE REQUEST ANALYSIS COMPONENT HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI • Goal: • to develop a document retrieval system based on statistical natural language processing. • Steps implemented: • Corpus acquisition • Corpus preprocessing and feature extraction • Creation of word category map • Development of baseline document retrieval system. HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI • Hypergeo Corpus profile HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI • Hypergeo Corpus Themes HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI Corpus processing and feature vector extraction • Corpus Processing • Corpus checking & merging • Text processing (html & plain text cleaning) • Stemming (Porter with stop-list) • Feature Vector Extraction • Stem frequencies computation • Vocabulary construction • Bigram generation • Manipulation of stem frequencies & bigram files • Collection of contextual statistics (average context vector) HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI • Word Category Map Creation • Fast winner search • Random projections for dimensionality reduction • Word category map (first results) HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI Future objectives Work done so far HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI 256 300 350 400 450 500 512 HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI (1,13) node clarksburg divers glasgow hump monasteri part portug roman romn tel (9,13) node burjasot catacomb centr cid comfort culmin histor miss novemb pelagio piedra pride silesia stit (11,15) node abdul arco belmaco creme fresco got gothic liter masstourist mediev mourn palac riunion rua splendid (13,14) node american central montreal perch process produc raymond romant scienc serv triumphal unusu upgrad venic (22,4) node artwork impoverish museo personalis rainer spindleruv tourism veranda vouli (23,14) node backpack born calahorra citi highland huelva marri mountain nice pseudo student victor (24,15) node anaya andov basqu beauti build doric extremadura fu goddess ibiza magic monasterio pulpit stone visit (25,15) node arrebatacapa bedroom danc mobil orlean peplo street therapeut torno tour Some of the characteristic nodes on the network HYPERGEO 1st technical verification
ARISTOTLE UNIVERSITY OF THESSALONIKI • Baseline document retrieval system • Statistics: collection frequency, term frequency and document length. • Query terms are given by the user • Stemming of the query terms (Simple and Porter Stemmer) • Look up of each query term in the structure that holds term-document-combined weight • Document’s score calculation: sum of the combined weights of all the query terms in the specific document • Document Ranking: determined by the user • a. according to their estimated score • b. according to i) the number of query terms that appear in it and ii) their estimated score HYPERGEO 1st technical verification 20/07/2000, Page 7
ARISTOTLE UNIVERSITY OF THESSALONIKI • Recall – Precision Graph for the query “museum” HYPERGEO 1st technical verification 20/07/2000, Page 11