800 likes | 951 Views
A search engine for a mixture of European languages. Overview. Introduction Design Data Features Document alignment Language detection and query translation Link analysis Evaluation Future work API - Interface Demo. Introduction. Goals Approach Project name. Goals.
E N D
Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo
Introduction • Goals • Approach • Project name
Goals • Build Cross Language search engine for a large collection of European web documents • Participate in WebCLEF • Create topics • Submit runs • Do something extra
Main Challenge • Deal with multiple languages • Search engine will have to • Accept queries in multiple languages • Return results in multiple languages
Approach • Feature based • Take existing retrieval engine • Extend it with several features • Document alignment (dictionaries) • Language detection and query translation • Link analysis • User interface + project website
Project Name • MELANGE • Multi European LANGuage Engine
Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo
Approach • Focus on cross-lingual features • Not: to build a search engine from scratch • We used an existing search engine to jump-start our development
Terrier • Open-source Java search engine • Modular design • Clean and clear interfaces • Easy to understand • Intended to be extended • Supports new modules to be plugged in • Several ranking models built-in
Apache Cocoon • Online document publishing framework • Open-source Java project • XML-based • Highly flexible ‘pipeline’ system
Features • Query language classification • Query translation using dictionaries • Use PageRank to improve scores • Feature-rich user interface
Interface Use Cases The user can: • Submit a query • Specify native language, then submit Optionally: • Specify filter for preferred languages in results
Melange Architecture • One ‘search’ pipeline in Cocoon • Features are integrated as ‘transformers’ on this pipeline
Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo
About the EuroGOV dataset • Pages of government websites from 27 domains:at be cy cz de dk ee es eu.int fi fr gr hu ie it lt lu lv mt nl pl pt ru se si sk uk • 86 Gigabytes of raw data • 11 Gigabytes compressed • Includes a crude language detection and list of exact duplicates by MD5 hashes
EuroGOV domain files • Every domain consists of 1 to 27 compressed files with size 6-220MB:se/001.gz se/003.gz se/005.gz sk/001.gz sk/003.gz se/002.gz se/004.gz si/001.gz sk/002.gz • Those files contain multiple documents (up to 25,000) in pseudo-XML.“This might smell like XML but it will not be XML.”
Bad news: pseudo-XML Nothing is escaped: • & • Content may contain nested <![CDATA[ or ]]> • Even worse: URL attribute can contain " which is not escaped! <EuroGOV:doc url="http://www.regeringen.se/" id="Ese-001-35" md5="659b462005b40f04bde5946b2beaad71" fetchDate="Wed Sep 22 10:57:39 MEST 2004" contentType="text/html"> <EuroGOV:content> <![CDATA[ ... content ...]]> </EuroGOV:content></EuroGOV:doc>
URLs are very unclean • Bad URL attribute:url="http://www.micr.cz/scripts/detail.php?id=1410">"should have beenurl="http://www.micr.cz/scripts/detail.php?id=1410"andurl="http://www.bmgs.bund.de/deu/txt/service/links/="/deu/txt/index_1766.cfm""should have beenurl="http://www.bmgs.bund.de/deu/txt/index_1766.cfm"but becameurl="http://www.bmgs.bund.de/deu/txt/service/links/"
Python data tools • Terrier indexing mechanism was designed to be fast • Parsers can be replaced by your own classes • Designed for sequential term-by-term processing only • Only supports parsing of a document ID and the document content, but not the URL, content-type, etc • Solution: Python EuroGOV parser module
Python EuroGOV parser design Class EuroGOVProcessor with overridable methods: • processStart • domainStart: “be” • domainFileCheck: “001.gz” • documentHeaderCheck: check by URL, document ID only • documentProcess • … • domainFileCheck: “002.gz” • … • domainEnd • processEnd
DocumentProcess • Inside the documentProcess event an instance of class Document is passed. • Supports extraction of: • url, id, HTTP content type header, date, md5 • content format (HTML, PDF, Word) • codepage • URL extraction • HTML -> Text • HTML tag extraction (get content of <title> tag, for example)
Tools created • Extracting clean text with unambiguous language detection • Extracting link structure from the dataset • Quickly extracting a single document • Dataset language reclassification • Converting the dataset into something indexable by Terrier TREC classes • Snippet server (getting the raw document in real-time)
Thanks to… • BeautifulSoup module very valuable for robust HTML parsing“You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.” • Psyco, a specializing compiler for Python (like Java JIT compiler), making unmodified programs run 2-100x faster
Indexing with Terrier • Converted the EuroGOV dataset into the TREC file format (with Unicode support) • Size of raw text (annotated with language) compressed is 2.5 GB instead of 11 GB • Supposedly Terrier stands for “Terabyte retrieval engine” but it hung with OutOfMemory exception after 80% • Increasing Java VM size from 512MB to 2GB ‘fixed’ it • The same had to be done for retrieval engine…
Terrier indexing: metadata • Added support for extra metadata per document: PageRank and language • Requires rewriting classes such as DocumentIndex, DocumentIndexBuilder, etc • Cannot add variable length fields • DocumentIndex contains a very nice assumption: documents must be indexed in lexicographical order of the document ID string, otherwise lookup by this string will not work
Statistics • Constructing the dataset in TREC format took about a day on 7 staff machines in parallel (about 4 days on 1 machine) • True indexing with Terrier took 24 hours • Index size: 7.4 GB • Serving snippets requires uncompressed dataset (86 GB)
Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo
Overview Document Alignment • Introduction • Goal • Approach • Results
Introduction • sentence level • word level • dataset
Goal • simple automatic chain of building dictionaries from two texts which are known to be translations • easy to use, self-explanatory dictionaries
Approach • Step 1 collect data (buildDocs.java) • Step 2 prepare data for sentence alignment (plain2align.py) • Step 3 sentence alignment (align) [GaleChurch] • Step 4 rewrite output of Step 3 (align2giza.py) • Step 5 preparing output of Step 4 for word alignment (plain2snt.out) • Step 6 producing word classes (mkcls) [Och] • Step 7 word alignment (GIZA++v2) [Och]
Results • http://student.science.uva.nl/~eigenman/ii/dict.php
Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo
Query Translation - Introduction • Query’s language detected => it has to be translated • to a subset of languages the user has selected (all supported EU languages by default) • Reliable offline dictionaries could not be found • Travlang’s ERGANE - vocabularies too limited • FREELANG – can be edited online by anyone, data cannot be accessed without a GUI • How about online dictionaries?
Query Translation – Online translating tools • Google’s Translation Tools, BabbelFish, WorldLingo, etc. • Slower, limited language support • have to connect to the URL for every query • typically support 8-9 languages within EU • WorldLingo was the translator of the choice • offers Textbox Translator and Website Translator • available in 9 major EU languages => multilingual support of MELANGE was restricted from then on • performs some phrase-matching • uses HTML forms for input/output
Query Expansion • Use local dictionaries, built by the Document Alignment team, to expand the query • contain probabilities of word matches across documents in different languages • often contain synonyms or related words • Append words with relatively high probability to the query translation • Term-based, as opposed to online translators, which are phrase-based
Language Detection • Has been studied in last few years • Considered a solved problem • Techniques: • Word Lists • Common Words • N-Grams
Basic Process • Standard machine learning process
Reinvent the Wheel? • We decided to implement our own language detector • Reasons: • Weren’t convinced by performance of freely available tools • Learning Factor! • New Approach
Our Approach • Need to be able to detect language on short queries accurately • Decided to use character n-grams • Decided to use tri-grams
Common N-Gram Techniques • Extract all n-grams from text and order by frequency
N-Grams Our Way • Inspired by stochastic language model • Uses a probabilistic way to define a syntax • Rules are stored in form of n-grams (CONTEXT, WORD, PROBABILITY) • Can be used for generating strings • Can also be used to calculate probability that a string was generated by a grammar
The Basic Idea • Define a stochastic language model for generating words in each language • Classify by comparing probabilities that a word was generated by a language
Example • Training text: “Test text” (CONTEXT, CHAR, PROBABILITY) => (cn-2 cn-1, cn, P( cn | cn-1, cn-2) ) Probabilities are calculated as ( ^^, t, 1.0 ) ( ^t, e, 1.0 ) ( te, s, 0.5 ) ( es, t, 1.0 ) ( st, , 1.0 ) ( te, x, 0.5 ) ( ex, t, 1.0 ) ( xt, , 1.0 )
Classification • Probabilities of a word being generated are given by: • For the previous example: • Classification:
Implementation - Issues • Speed • Encoding • Machine Precision • Noisy Data • Zero Frequency