140 likes | 174 Views
Using a Genetic Algorithm for Approximate String Matching on Genetic Code. Carrie Mantsch December 5, 2003. Outline. Problem Statement Current Techniques GA Motivation My Algorithm Results Extension Possibilities. Problem Statement.
E N D
Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003
Outline • Problem Statement • Current Techniques • GA Motivation • My Algorithm • Results • Extension Possibilities
Problem Statement The problem is to search and align strands of DNA using a genetic algorithm.
Current Techniques • Approximate string matching • Usually meant for smaller strings • Many are set up for k mismatches • 2 DNA strands of size 90 and 85 • Allowing for 5 gaps in the second strand gives almost 44 million possible alignments
Current Techniques (cont.) • Needleman-Wunsch • Gap penalty -1 • Match bonus +1 • Mismatch 0 • Not practical if the sequence starts in the middle • Counts the gaps at the beginning and end as penalties.
Current Techniques (cont.) • BLAST (Basic Local Alignment Search Tool) and FASTA • Use domain specific knowledge • http://www.ncbi.nlm.nih.gov/BLAST • http://fasta.bioch.virginia.edu
GA Motivation • Alien DNA • Junk DNA • Extendable to similar text searches without domain specific knowledge
My Algorithm • The population • Bit strings of 0’s and 1’s • 0’s are spaces, 1’s mean a letter is placed there • The number of 1’s stays constant as the number of letters in the smaller search string
My Algorithm (cont.) • Breeding • Rank based selection • Crossover • The common place markers are kept the same • The rest of the place markers are split evenly between the two children
My Algorithm (cont.) • Mutation • If the amount of gaps is less than one tenth of the small string size add a gap • Otherwise delete a gap
Results • The target match
Results (cont.) • Ran for 50 generations • Different random numbers for the same number of generations give best fitness values between about 32 and 67 (optimal fitness - 90)
Extension Possibilities • Better representation of population • Be able to alter fitness evaluation to be more specific to different problems • Ability to add domain specific knowledge • Parallel searching