750 likes | 1.01k Views
Genetic Programming Applied to Natural Language Processing. Dr. Nguyen Quang Uy Faculty of Information Technology Military Technical Academy Hanoi, Vietnam. Outline. An Introduction to Genetic Programming Genetic Programming Applied to Natural Language Processing Text Summarization
E N D
Genetic Programming Applied to Natural Language Processing Dr. Nguyen Quang Uy Faculty of Information Technology Military Technical Academy Hanoi, Vietnam
Outline • An Introduction to Genetic Programming • Genetic Programming Applied to Natural Language Processing • Text Summarization • Information Retrieval • Text Classification
What is Genetic Programming (GP)? • Genetic Programming is an evolutionary paradigm that is inspired by biological evolution to find solutions that perform an user defined task
Genetic Programming: The History • Initiated during 1980s (Cramer 1985, Schmidhuber 1987) • Broaden by Koza, 1992 • More than 5000 researchers and more than 7000 publications (Langdon, GP bibliography: http://www.cs.bham.ac.uk/~wbl/biblio)
Genetic Programming: The Algorithm • Step 0: Stochastically generate the initial population, P(0) • Repeat • Step 1: Evaluate the fitness (how good) of each individual in population P(t) • Step 2: Select parents from P(t) based on their fitness • Step 3: use stochastic variation operators to get P(t+1) • Until termination conditions are satisfied
+ 1 2 Tree Based Genetic Programming • Various structures are used to represent the solution in GP, but tree structures are the most well known. • Original Idea: Cramer (1985); Schmidhuber (1987); Koza (1992)
The Advantages of Tree Based Representation • Allow GP to find the structure of the solutions • Allow computer programs can be evolved
+ * sin 1 2 2 Tree Structure • Nonterminal nodes (function nodes): are node that have at least one child. • Terminal nodes: are the nodes that make up the leaves a tree
Genetic Programming Components • Terminal Set • Work as set of primitive data types • Constants • Parameterless functions • Input Values • Function set • Set of available functions • Often tailored specifically for the needs of the program domain.
Sufficiency & Closure • Function and terminal sets must satisfy the principles of closure and sufficiency: • Closure: • very function f must be capable of accepting the values of every terminal t from the terminal set and every function f from the function set. • Sufficiency: • A solution to the problem at hand must exist in the space of programs created from the function set and terminal set.
Genetic Programming Initialisation • Typical: ramped half-and-half initialisation • Ramped: • Choose a lower and upper bound for tree depth • Generate trees with maximum depths distributed uniformly between these bounds • Half and half: • 50% full trees • At depth bound, nodes chosen uniformly randomly from constant symbols • Elsewhere, nodes chosen randomly from the function symbols • 50% grow trees • At depth bound, nodes chosen randomly from constant symbols • Elsewhere, nodes chosen randomly from all symbols
Genetic Operators: Crossover • Crossover is the primary operator in GP • Crossover • Randomly select a node in the mother • Randomly select a node in the father • Swap the two nodes along with their subtrees
Crossover Example Parent 2 - Parent 1 + / * + sin 13 4 2 abs 1 2 2 -7 Child 1 + Child 2 - * 1 sin + / 2 2 2 abs 13 4 -7
Genetic Operations: Mutation • Mutation • Randomly select a node in the program tree • Remove that node and its subtree • Replace the node with a new subtree, using the same method used to initially instantiate the population. • Typically, mutation is applied to a small number of offspring after crossover.
Mutation Example + Left subtree is randomly selected for mutation. * + 1 2 3 4 + The entire subtree is replaced * - 1 2 2 * 7 4
Fitness Measures • Fitness gives “graded and continuous feedback about how well a program performs on the training set” • Standardized Fitness • Fitness scores are transformed so that 0 is the fitness of the most fit member. • Normalized Fitness • Fitness is transformed to values that always are between 0 and 1.
Sample Fitness Measures • Error Fitness • The sum of the absolute value of the differences between the computed result and the desired result. Where: fp is the fitness of the pth individual in the population oi is the desired output for the ith example in the training set pi is the output from the pth individual on the ith example in the training set
Fitness Measures can be as Varied as the Applications • Examples • Number of correct solutions • Number of wins competing against other members of the population. • Number of errors navigating a maze • Time required to solve a puzzle
GP Selection • Truncation selection • Select the best k% of the population • Generally too eager • Fitness proportionate selection • Probability of selection proportionate to fitness • Tournament selection • Choose k individuals uniformly randomly • Select the best of those individuals • Eagerness tunnable by k • Larger k = more eager algorithm • The most commonly used today
Summary: The Preparatory Steps for GP • Define the terminal set • Define the function set • Define the fitness function • Define parameters such as population size, maximum individual size, crossover probability, mutation probability, selection method • The method for terminating a run
Example: Symbolic Regression Problem: Can GP evolve the function to fit the following data:: x f(x) 0 0 1 4 2 30 3 120 4 340 5 780 6 1554 7 2800 8 4680
GP Symbolic Regression Function Set: +, - *, / Terminal Set: X Fitness Measure: use the absolute difference of the error. Best normalized fitness is 0. Parameters: Population Size = 500, Max Generations = 50, Crossover = 90%, Mutation = 10%, Reproduction = 10%. Selection is by Tournament Selection (size 3), Creation is performed using RAMP_HALF_AND_HALF. Max depth of tree: 16 Termination Condition: Program with fitness 0 found.
Results The following zero-fitness individual was found after four generations (add (add (mul (mul X X) (mul X X)) (mul (mul X X) (- X))) (sub X (sub (sub (sub X X) (mul X X)) (mul (add X X) (mul X X))))) which correctly captures the function: f(x) = x4 + x3 + x2 + x
Genetic Programming Applications • Electronic Design
Genetic Programming Applications • Antenna Design for NASA
Genetic Programming Applications • Ecological Modelling
Genetic Programming: The Human Competitive Results • Koza (2010): www.human-competitive.org • 76 instance of work that successfully solved problems by GP and obtained results better or equal than previous best results obtained by human experts (many of them were patented !).
Natural Language Processing • Define (Liddy, 2001): • Natural Language Processing is a theoretically motivated range of computational techniques for analyzing and representing texts for the purpose of achieving human-like language processing for a range of tasks or applications.
Major Tasks in Natural Language Processing • Text Summarization: Produce a readable summary of a chunk of text. • Machine Translation: Automatically translate text from one human language to another. • Text Classification: Automatically sort a set of documents into categories from a predefined set. • Part-of-speech tagging: Given a sentence, determine the part of speech for each word.
Major Tasks in Natural Language Processing • Parsing: Determine the parse tree (grammatical analysis) of a given sentence. • Question Answering: Given a human-language question, determine its answer. • Word Segmentation: Separate a chunk of continuous text into separate words. • Information Retrieval: This is concerned with storing, searching and retrieving information. • …….
Where In NLP Has GP Been Applied • Text Summarization • Text Classification • Information Retrieval • Parsing • Part-of-speech tagging
Text Summarization • Problem statement: Given a document T, create a shortened version of T by a computer program.
Two Approaches in Text Summarization • Abstraction: Create a new text version that summarize the source document • Extraction: Select the important sentences, from the original document to create the summary. • Most of the works in this area are based on extraction.
GP Applied to Text Summarization • Based on the extraction method. • Aim: To evolve a function that gives to each sentence a score presenting the important role of that sentence in the document. • N sentences with top scores are selected as an extract.
The Previous Work • Zhuli Xie et al, Using Gene Expression Programming to Construct Sentence Ranking Functions for Text Summarization, 2004. • Arman Kiani-B and M. R. Akbarzadeh-T, Automatic Text Summarization Using: Hybrid Fuzzy GA-GP, 2006.
Data used in the system • Data from CMPLG corpus • CMPLG corpus is composed of 183 documents from the Computation and Language (cmp-lg) collection. • The documents are all scientific papers. • 60 documents from CMPLG corpus are used • 50 for training • 10 for testing
Sentence Features • Location of the Paragraph • Location of the Sentence • Length of the Sentence • Heading Sentence • Content-word Frequencies
How Does the System Work? • Input: • N documents in the training set after being converted into a set of sentence feature vectors • N objective summary of each document • Output: The function for scoring sentences in the documents • Each GP individual is a scoring function • This scoring function is applied to every sentence feature vectors and produces a score accordingly.
How Does the System Work? • All sentences in the same training document are ranked according to their scores. • N sentences with top scores are selected as an extract. • Compare the similarity between the extract and the objective summary. • Cosine function is used. • Fitness of an individual is the average of the similarity between the extract and the objective summary over all training documents.
GP Parameters • A variant version of GP (Gene Expression Programming - GEP) is use. • Function set: +, -, *, /, power, sqrt, exp, log, min, max. • Terminal set: 5 futures and 1, 2, 3, 5, 7 • Population size: 265 • Crossover probability: 0.5 • Mutation Probability: 0.2 • ....
Result Comparison • The results produced by GP is compared with: • The lead-based method: • Selects the first sentences from the first five paragraphs • The randomly-selected method: • Randomly chooses five sentences from the document. • The random-lead-based method: • Randomly chooses five sentences among the first sentences from all paragraphs • The results produced by GP is better than above three methods.
Some Comments on This Work • Language of documents: English • Topic of data: Scientific papers • Number of Sample Data: 60, still small • GP version: GEP, a quite basic version • Comparison with other method: Very simple
How Can We Go Further? • Summarization for other languages • Vietnamese • Other topics: News, Story • Data Sample: Need to be increased • Use advance techniques in GP: Techniques for improving generalization ability of GP • Compare with more the previously sophisticated methods
Information Retrieval • Define (Christopher et al., 2009): • Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). • Google Search is a typical Information Retrieval system • In a Information Retrieval system, learn to rank is the major task.
Learn to Rank for Information Retrieval • Problem statement: • Given a set of documents D={D1, D2,…, DN}, and a query q, find a function f so that it ranks the documents in D based on their relevance to q.
GP Applied to Learn to Rank • Aim: To evolve the function that ranks to documents for each query input by users.
The Previous Work • Jen-Yuan Yeh et al., Learning to Rank for Information Retrieval Using Genetic Programming, 2007. • Shuaiqiang Wang et al., Learning to rank using evolutionary computation: immune programming or genetic programming?, 2009. • Feng Wang2 and Xinshun Xu, AdaGP-Rank: Applying boosting technique to genetic programming for learning to rank, 2010.
Data used in the system • The LETOR benchmark datasets • released by Microsoft Research Asia for research on learning to rank for Information Retrieval • Consist of OHSUMED and TREC (TD2003 and TD2004) • TD2003 and TD2004 were used in the paper • There are 49,171 instances in TD2003 and TD2004 • Each instance is a vector of features and a number indicating the degree of the relevance between the query and the document.
Data used in the system • The data is equally partitioned into 5 subsets to conduct 5-fold cross validation • For each fold • 3 subsets are used for training • 1 subset for validating • 1 subset for testing