240 likes | 412 Views
Solving Italian Crossword Using the Web. Giovanni Angelini, Marco Ernandes and Marco Gori DII, University of Siena http://airgroup.dii.unisi.it http://webcrow.dii.unisi.it. Solving Crosswords. Crossword puzzles are probably the most popular language game played.
E N D
Solving Italian Crossword Using the Web Giovanni Angelini, Marco Ernandes and Marco Gori DII, University of Siena http://airgroup.dii.unisi.it http://webcrow.dii.unisi.it
Solving Crosswords Crossword puzzles are probably the most popular language game played. It requires a combination of skills: • a good comprehension of the clues expressed some times in an ambiguous or tricky way. • a large knowledge base. • a clever heuristic in order to fill correctly the puzzle. Problems like solving crosswords from clues are reputed as AI-complete.
The idea Attack crosswords (within competition time limits)making use of the Web, being this the most extremelyrich and self-updating repositoryof human knowledge. Try to enfold with semantics real-life concepts using: • The Web. • Searchengines. • Information retrieval and machine learning techniques.
Two main sub-problems • Clue-Answering: the aim is to associate each clue to the correct word answer. For each clue a ranked list of candidate solutions has to be generated. • Grid Filling:a Constrain-Satisfaction Problem. From each clue lista candidate has to be chosen and inserted in the crossword-puzzle, trying to satisfythe intrinsic constrains.
Clue Answering Clue-answering differs from Question Answering in various forms: • There is no standard interrogative form. • There is an intrinsic and volunteer ambiguity. • The topic of the question can be both factoid and non-factoid. • There is a unique and precise correct answer: a single or a compound word. • Very high recall is required: missingthe target could lead to disaster in grid-filling.
Generating the candidate lists Modules for generating the candidate lists: • The Web Search Module: find answers by exploiting the Web and search engines. • The Data-Based Module: returns possible candidatesmaking exact and partial matching on the clues of solved crosswords. • The Rule-BasedModule: deals with clues whose answers have no semantic relation, but that are crypticallyhidden inside the clues them-self. • The Dictionary Module: is used to increasethe global coverage of the clue-answering.
The Web Search Module There are four task: • The retrieval of useful web documents. • Theextraction of the answer candidates from these documents. • The scoring/filtering of thecandidate lists. • The estimation of the list condence.
Retriving useful documents • Each clue C = t1t2…tngenerates a maximum of 2 queries: Q1=< t1and t2and…tn> Q2=< t1or t2or…tn>. Non informative words are removed from the queries. • The first n ranked documents are downloaded (time consuming).
Extracting and ranking the candidates • The documents are analysed by a parser which produces as output plain ASCII text. • This text is passed to a listgenerator that extracts the words(or compound-words) of the correct length. • Then passed to two submodules: a statistical filter, based on IRtechniques, and a morphological filter, based on machine learning and NLP techniques. Finally, the score-probability for each candidate w is:
The Statistical filter Score of the candidate w inside Di retrieved with query Qn: The distance between word wkand query Qn inside Di:
Target position The frequency of the target in the first n positions in relation to its length with and without the WSM.
Merging the lists and filling the grid • Merging: all the lists regarding a slot are merged into a unique list. • Grid filling: The goal is to assign a word toeach slot in order to maximize the similarity between the final conguration and the target solution. We adopted themaximum probability function: Due to the time restrictions and to thecomplexity of the problem we chose as a solving algorithm a CSP version of WA*.
The data set The crossword collection is partitioned in five subsets: • T1ordcontaining examples of ordinarydifficulty from La Settimana Enigmistica. • T1difdesigned for skilled cruciverbalists from La Settimana Enigmistica. • T2newcrosswords that were publishedin 2004 from La Repubblica. • T2oldcrosswords that were publishedin 2001-2003from La Repubblica. • T3is a miscellaneous of examples from crosswordspecializedweb sites.
Experimental results The performance overthe full test set is of 68,8%correctwords and of79,9% correct letters.Extendeding the time limit to 45 min., performancesincrease by a 7% in average.
Conclusions • Promising results:the version of WebCrow that is discussed here is basic but it has already given verypromising results. • Web-search approach: the web-search approach hasproved to be very consistent. • Many intersting problems: we believe it could suite all those problems in whichsemantics and interpretation play an important role. • Expert modules:WebCrow's overall architecture allows to plug in several expert modulesin order to increase the system's performances.
Our Objectives • Webcrow vs humans: one of our main objectives is to build a system competitive with human experts in solvingcrosswords, and hopefully challenge masters in a real competition. • Crosswords in different languages: we aim for a system capable of solving crosswords in different languages by exploitinglanguage-independent and data-driven techniques, such as machine learning, avoiding(or limiting) pre-compiled rules, as usually done by question answering systems.
Solving Italian Crossword Using the Web Giovanni Angelini, Marco Ernandes and Marco Gori DII, University of Siena http://airgroup.dii.unisi.it http://webcrow.dii.unisi.it