A search engine for a mixture of European languages

A search engine for a mixture of European languages

Overview • Introduction • Design • Data • Features • Document alignment • Language detection and query translation • Link analysis • Evaluation • Future work • API - Interface Demo

Introduction • Goals • Approach • Project name

Goals • Build Cross Language search engine for a large collection of European web documents • Participate in WebCLEF • Create topics • Submit runs • Do something extra

Main Challenge • Deal with multiple languages • Search engine will have to • Accept queries in multiple languages • Return results in multiple languages

Approach • Feature based • Take existing retrieval engine • Extend it with several features • Document alignment (dictionaries) • Language detection and query translation • Link analysis • User interface + project website

Project Name • MELANGE • Multi European LANGuage Engine

Approach • Focus on cross-lingual features • Not: to build a search engine from scratch • We used an existing search engine to jump-start our development

Terrier • Open-source Java search engine • Modular design • Clean and clear interfaces • Easy to understand • Intended to be extended • Supports new modules to be plugged in • Several ranking models built-in

Apache Cocoon • Online document publishing framework • Open-source Java project • XML-based • Highly flexible ‘pipeline’ system

Features • Query language classification • Query translation using dictionaries • Use PageRank to improve scores • Feature-rich user interface 

Interface Use Cases The user can: • Submit a query • Specify native language, then submit Optionally: • Specify filter for preferred languages in results

Use Case Diagram

Melange Architecture • One ‘search’ pipeline in Cocoon • Features are integrated as ‘transformers’ on this pipeline

About the EuroGOV dataset • Pages of government websites from 27 domains:at be cy cz de dk ee es eu.int fi fr gr hu ie it lt lu lv mt nl pl pt ru se si sk uk • 86 Gigabytes of raw data • 11 Gigabytes compressed • Includes a crude language detection and list of exact duplicates by MD5 hashes

EuroGOV domain files • Every domain consists of 1 to 27 compressed files with size 6-220MB:se/001.gz se/003.gz se/005.gz sk/001.gz sk/003.gz se/002.gz se/004.gz si/001.gz sk/002.gz • Those files contain multiple documents (up to 25,000) in pseudo-XML.“This might smell like XML but it will not be XML.”

Bad news: pseudo-XML Nothing is escaped: • & • Content may contain nested <![CDATA[ or ]]> • Even worse: URL attribute can contain " which is not escaped! <EuroGOV:doc url="http://www.regeringen.se/" id="Ese-001-35" md5="659b462005b40f04bde5946b2beaad71" fetchDate="Wed Sep 22 10:57:39 MEST 2004" contentType="text/html"> <EuroGOV:content> <![CDATA[ ... content ...]]> </EuroGOV:content></EuroGOV:doc>

URLs are very unclean • Bad URL attribute:url="http://www.micr.cz/scripts/detail.php?id=1410">"should have beenurl="http://www.micr.cz/scripts/detail.php?id=1410"andurl="http://www.bmgs.bund.de/deu/txt/service/links/="/deu/txt/index_1766.cfm""should have beenurl="http://www.bmgs.bund.de/deu/txt/index_1766.cfm"but becameurl="http://www.bmgs.bund.de/deu/txt/service/links/"

Python data tools • Terrier indexing mechanism was designed to be fast • Parsers can be replaced by your own classes • Designed for sequential term-by-term processing only • Only supports parsing of a document ID and the document content, but not the URL, content-type, etc • Solution: Python EuroGOV parser module

Python EuroGOV parser design Class EuroGOVProcessor with overridable methods: • processStart • domainStart: “be” • domainFileCheck: “001.gz” • documentHeaderCheck: check by URL, document ID only • documentProcess • … • domainFileCheck: “002.gz” • … • domainEnd • processEnd

DocumentProcess • Inside the documentProcess event an instance of class Document is passed. • Supports extraction of: • url, id, HTTP content type header, date, md5 • content format (HTML, PDF, Word) • codepage • URL extraction • HTML -> Text • HTML tag extraction (get content of <title> tag, for example)

Tools created • Extracting clean text with unambiguous language detection • Extracting link structure from the dataset • Quickly extracting a single document • Dataset language reclassification • Converting the dataset into something indexable by Terrier TREC classes • Snippet server (getting the raw document in real-time)

Thanks to… • BeautifulSoup module very valuable for robust HTML parsing“You didn't write that awful page. You're just trying to get some data out of it. Right now, you don't really care what HTML is supposed to look like.” • Psyco, a specializing compiler for Python (like Java JIT compiler), making unmodified programs run 2-100x faster

Indexing with Terrier • Converted the EuroGOV dataset into the TREC file format (with Unicode support) • Size of raw text (annotated with language) compressed is 2.5 GB instead of 11 GB • Supposedly Terrier stands for “Terabyte retrieval engine” but it hung with OutOfMemory exception after 80% • Increasing Java VM size from 512MB to 2GB ‘fixed’ it • The same had to be done for retrieval engine…

Terrier indexing: metadata • Added support for extra metadata per document: PageRank and language • Requires rewriting classes such as DocumentIndex, DocumentIndexBuilder, etc • Cannot add variable length fields • DocumentIndex contains a very nice assumption: documents must be indexed in lexicographical order of the document ID string, otherwise lookup by this string will not work

Statistics • Constructing the dataset in TREC format took about a day on 7 staff machines in parallel (about 4 days on 1 machine) • True indexing with Terrier took 24 hours • Index size: 7.4 GB • Serving snippets requires uncompressed dataset (86 GB)

Overview Document Alignment • Introduction • Goal • Approach • Results

Introduction • sentence level • word level • dataset

Goal • simple automatic chain of building dictionaries from two texts which are known to be translations • easy to use, self-explanatory dictionaries

Approach

Approach • Step 1 collect data (buildDocs.java) • Step 2 prepare data for sentence alignment (plain2align.py) • Step 3 sentence alignment (align) [GaleChurch] • Step 4 rewrite output of Step 3 (align2giza.py) • Step 5 preparing output of Step 4 for word alignment (plain2snt.out) • Step 6 producing word classes (mkcls) [Och] • Step 7 word alignment (GIZA++v2) [Och]

Results • http://student.science.uva.nl/~eigenman/ii/dict.php

Query Translation - Introduction • Query’s language detected => it has to be translated • to a subset of languages the user has selected (all supported EU languages by default) • Reliable offline dictionaries could not be found • Travlang’s ERGANE - vocabularies too limited • FREELANG – can be edited online by anyone, data cannot be accessed without a GUI • How about online dictionaries?

Query Translation – Online translating tools • Google’s Translation Tools, BabbelFish, WorldLingo, etc. • Slower, limited language support • have to connect to the URL for every query • typically support 8-9 languages within EU • WorldLingo was the translator of the choice • offers Textbox Translator and Website Translator • available in 9 major EU languages => multilingual support of MELANGE was restricted from then on • performs some phrase-matching • uses HTML forms for input/output

Query Expansion • Use local dictionaries, built by the Document Alignment team, to expand the query • contain probabilities of word matches across documents in different languages • often contain synonyms or related words • Append words with relatively high probability to the query translation • Term-based, as opposed to online translators, which are phrase-based

Language Detection • Has been studied in last few years • Considered a solved problem • Techniques: • Word Lists • Common Words • N-Grams

Basic Process • Standard machine learning process

Reinvent the Wheel? • We decided to implement our own language detector • Reasons: • Weren’t convinced by performance of freely available tools • Learning Factor! • New Approach

Our Approach • Need to be able to detect language on short queries accurately • Decided to use character n-grams • Decided to use tri-grams

Common N-Gram Techniques • Extract all n-grams from text and order by frequency

N-Grams Our Way • Inspired by stochastic language model • Uses a probabilistic way to define a syntax • Rules are stored in form of n-grams (CONTEXT, WORD, PROBABILITY) • Can be used for generating strings • Can also be used to calculate probability that a string was generated by a grammar

The Basic Idea • Define a stochastic language model for generating words in each language • Classify by comparing probabilities that a word was generated by a language

Example • Training text: “Test text” (CONTEXT, CHAR, PROBABILITY) => (cn-2 cn-1, cn, P( cn | cn-1, cn-2) ) Probabilities are calculated as ( ^^, t, 1.0 ) ( ^t, e, 1.0 ) ( te, s, 0.5 ) ( es, t, 1.0 ) ( st, , 1.0 ) ( te, x, 0.5 ) ( ex, t, 1.0 ) ( xt, , 1.0 )

Classification • Probabilities of a word being generated are given by: • For the previous example: • Classification:

Implementation - Issues • Speed • Encoding • Machine Precision • Noisy Data • Zero Frequency

A search engine for a mixture of European languages

A search engine for a mixture of European languages

Presentation Transcript

Choosing a Search Engine

Choosing a Search Engine

A Framework for Human Evaluation of Search Engine Relevance

A search engine for phylogenetic tree databases

Frompo a Search Engine

What is a Search Engine?

What is a Search Engine?

A Search Engine for 3D Models

What is a Search Engine

Aquabrowser : a search engine for Libraries

SS6G11 A Diversity of European Languages

SS6G11 A Diversity of European Languages

A Search Engine That Learns

Anatomy of a search engine

XSEarch: A Semantic Search Engine for XML

A Search Engine for Historical Manuscript Images

What is a Search Engine?

What is a Search Engine?

Quality of a search engine

A Specialised Search Engine for Neuroscience WebPages

A Search Engine Optimization FAQ

The Anatomy Of A Search Engine