310 likes | 726 Views
Synonymous Paraphrasing Using WordNet and Internet. Igor A. Bolshakov & Alexander Gelbukh Center for Computing Research National Polytechnic Institute Mexico City, Mexico { igor,gelbukh}@cic.ipn.mx. Contents. Synopsis Absolute and Relative Synonyms Collocations
E N D
Synonymous Paraphrasing Using WordNet and Internet Igor A. Bolshakov & Alexander Gelbukh Center for Computing ResearchNational Polytechnic InstituteMexico City, Mexico {igor,gelbukh}@cic.ipn.mx
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Synopsis – 1 We propose a method of synonymous paraphrasing of a text based on • WordNet synonymy data and • Internet statistics of stable word combinations (collocations). Given a text, we look for words or word sequences in it for which WordNet provides synonyms, and substitute them with such synonyms only if the latter form valid collocations with the surrounding words according to the statistics gathered from Google
Synopsis – 2 We present two important applications of local synonymous paraphrasing: • Style checking and correction: automatic evaluation and computer-aided improvement of writing style with regard to various aspects • Steganography: hiding of additional information in the given text by special selection of collocationally verified synonyms
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Absolute and Relative Synonymyin general • Text variations that conserve whole text’s meaning are called synonymous paraphrasings • There exist global and local types of synonymous paraphrasing • Local paraphrasing only replaces separate words (which have synonyms) conserving the word order and the number of words • Synonyms are words or multiwords that can replace each other in some class of contexts with insignificant change of the whole text’s meaning • A synonymy dictionary consists of groups of words considered synonyms to each other • WordNet contains a type of synonymous dictionary • There exist absolute and relative synonyms
Absolute and Relative SynonymsExamples • Relative synonyms- {(to) schedule,plan, design, map out, project, lay on, scheme}- {rollercoaster, big dipper, Russian mountains} • Absolute synonyms- {sofa, settee}- {United States of America, United States, USA, US}- {former president, ex-president}
Synonymous Dictionarywe need • Synonymy dictionary such as in WordNet or EuroWordNet • A specially compiled dictionary of absolute synonyms that contain all abovementioned types of English equivalents Our algorithms look up first the absolute synonymy subdictionary
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Collocations in general • Collocation is a syntactically connected and semantically compatible pair of content (i.e. non-functional) words • Syntactical connectedness is understood as in dependency grammars (I. Melčuk) • Examples of English collocations are: full-lengthdress, wellexpressed, to brieflyexpose, to pick up the knife, to listen to the radio, energyfield,to promise to marry, to flatlyreject • Collocation components are connected to each other directly or through auxiliary words
Collocation Databases For English, collocation databases exist only in printed form. The best is: Oxford Collocations Dictionary for Students of English. Oxford University Press, 2003 In this paper we consider Google search engine as a collocation database
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Evaluations of Collocations via Googlein general • Google statistics on occurrences of words or word sequences is given in number of web pages containing these items in any amounts • There are only two ways to evaluate the occurrence numbers of a collocation by giving its components: • in quotation marks (underestimation) • without them (overestimation) • It is necessary to propose an heuristical measure in between those mentioned • It is also necessary to introduce a threshold , to exclude marginal situations
Evaluations of Collocations via Google Statistics on synonymous collocations with project
Evaluations of Collocations via Google Collocations with synonyms of departments:departments 42% offices 15% services 43%
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Types of Synonymous Paraphrasing • Text compression-the shortest synonyms are taken • Text canonization- the most frequently used synonyms are taken • Text simplification- synonyms more intelligible for language-impaired persons are taken (special marks of colloquialism are needed) • Conformistic variations- synonyms with the Internet distribution are randomly taken • Individualistic variations- nearly marginal synonyms within the Internet distribution are taken
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Algorithm of Interactive Paraphrasing Ask mode {compression, canonization, simplification, conformistic, individualistic} Ask marginality threshold (0,1) and sensitivity threshold (0,1) For each content word or multiword w which is a member of a synset Let S = union of all relevant synsets for w For each word v in S If its appropriateness a(v) < then set score(v) = 0 else If mode = compression then set score(v) = 1 / length (v) If mode = canonization then set score(v) = a (v) If mode = simplification then set score(v) as described in S. 5 If mode = conformistic then set score(v) = random from 0 to a(v) If mode = individualistic then set score(v) = 1 / a(v) If score (w) / maxSscore (v) < then suggest to the user all variants v in S, score(v) 0, in the order of score(v)
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
An Experiment on Text ParaphrasingThe source text with possible replacements The Georgian foreign minister(foreign office head) is scheduled (planned, designed, mapped out, projected, laid on, schemed) to meet (have a meeting, rendezvous) with the heads(chiefs, top executives) of various(different, diverse) Russian departments(offices, services) and with a deputy of Russian foreign minister(foreign office head). “Issues(problems, questions, items)concerning(pertaining, touching, regarding) the future(coming, prospective) contacts at the higher(high-rank) level will be discussed(considered, debated, parleyed, ventilated, reasoned, negotiated, talked about) in the course of the meeting(receptions, buzz sessions, interviews),” said Georgian ambassador to Russia Zurab Abashidze. The Georgian foreign minister(foreign office head) will be in(visit) Moscow on a private(privy)visit(trip), the Russian Foreign Ministry reported(communicated, informed, conveyed, announced).
An Experiment on Text ParaphrasingThe text with conformistic variations The Georgian foreign office headis plannedto have a meeting with the headsof diverse Russian offices and with a deputy of Russian foreign office head. “Questionstouching the future contacts at the high-rank level will be debated in the course of the interviews,” said Georgian ambassador to Russia Zurab Abashidze. The Georgian foreign minister will visit Moscow on a private trip, the Russian Foreign Ministry informed.
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Style Evaluation:for Compressibility Set Compressibility to 0 For each content word w in the text Set S = union of all relevant synsets containing w Remove from S the members v below the marginality threshold Let v0 be the shortest word in S Increase Compressibility in length(w) – length(v0)
Contents • Synopsis • Absolute and Relative Synonyms • Collocations • Evaluations of Collocations via Internet • Types of Synonymous Paraphrasing • Algorithm of Interactive Paraphrasing • An Experiment on Text Paraphrasing • Another Application: Style Evaluation • Yet Another Application: Linguistic Steganography
Linguistic SteganographyTwo Inputs: • The information I to be hidden, merely as a bit sequence • Any natural language text of the minimal length of approximately 250 per bit of I. The text is orthographically correct and semantically “common” (not a sequence of proper names, numbers, rhymes, etc.)
Linguistic SteganographyAlgorithm: Search of synonyms- single or multiwords that have their own synsets Formation of synonymy groups- Search for unions of all relevant synsets Collocational verification of synonyms- Each member of the current group containing relative synonyms is tested as potential collocations together with its context wordsby Google statistics, with casting all inappropriate options Enciphering- The current group is cut in length to the nearest power p of 2 - The p-syllable, s, of the I is taken- The s-th synonym replaces the source synonym Reagreement
Linguistic SteganographyMore detail in the paper: Bolshakov, I.A. A Method of Linguistic Steganography Based on Collocation-Proven Synonymy. In: Proceedings of International Information Hiding Workshop IH2004, Toronto, Canada, May 2004. Lecture Notes in Computer Science, Springer, 2004 (now available only in the preprint form)
Thank you! Igor A. Bolshakov igor@cic.ipn.mx