350 likes | 364 Views
This project focuses on utilizing fuzzy logic and other methods for assembling nucleotide sequences into full genomes and creating characteristic genomes for efficient classification. It addresses challenges in genome sequencing, multiple sequence alignment, and metagenomics.
E N D
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly
Contents • Team • Bioinformatics • Genome Sequencing • Research Problem • Fuzzy Logic • Ongoing Work • Future Work
Team • Advisors: • PI: Dr. Gregory Vert (Dept. of Computer Science, University of Nevada Reno) • Co-PI: Dr. Alison Murray (Desert Research Institute, Reno) • Co-PI: Dr. Monica Nicolescu (Dept. of Computer Science, University of Nevada Reno) • Student : • Sara Nasser (Dept. of Computer Science, University of Nevada Reno)
Bioinformatics- Genome Sequencing • Genome sequencing is figuring out the order of DNA nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA. • Sequencing the genome is an important step towards understanding it. • The whole genome can't be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time.
Genome Sequencing • Much of the work involved in sequencing lies in putting together this giant biological jigsaw puzzle. • Various problems occur such as: • Errors in reading • Flips
Shot-GunSequencing • The "whole-genome shotgun" method, involves breaking the genome up into small pieces, sequencing the pieces, and reassembling the pieces into the full genome sequence.
Environmental Genomics • Multiple sequence alignment is an important first step in many bioinformatics applications such as structure prediction, phylogenetic analysis and detection of key functional residues. • The accuracy of these methods relies heavily on the quality of the underlying alignment. [1]
Multiple Sequence Alignment • The traditional multiple sequence alignment problem is NP-hard, which means that it is impossible to solve for more than a few sequences [1]. • In order to align a large number of sequences, many different approaches have been developed.
Tools and Techniques • MUMMER • Phrap, Phred, Consed • TIGR • The Smith-Waterman Algorithm • Tree-Based Algorithms
Meta-genomics • Meta-genomics is the application of modern genomics techniques to the study of communities of microbial organisms directly in their natural environments, bypassing the need for isolation and lab cultivation of individual species. [2]
Meta-genomics • Bacteria can often have minor variations in their DNA that can result in different metabolic characteristics. • The differences can make it difficult to classify bacteria taxonomically. • What has been needed is a method of creating a characteristic representation (characteristic genome) from the sub sequences of DNA found in several sub variant of a bacteria of the same species. • Such genome could be used for more efficient classification at a molecular level through the process of controlled generalization.
ResearchGoals • Given a collection of nucleotide sequences from multiple organisms, develop techniques based on fuzzy set theory and other methods for assembly of the sequences into the original full genome for each organism. • Using the above techniques to develop a generalized approach for creating a characteristic genome that represents a generalization of the original organisms that donated sequence data.
The Data • SYM (Original Raw Data): • Contains 302K Sequences • Average length of 450 base pairs (bp) • It was obtained from a community of bacteria • There is an estimated of 100 organisms • Lets say, for example 75% of data is repeated, we still need to reassemble a sequence of ~ 33 Million bp
Motivation • Current tools could not solve the problem: • Complexity of the dataset, since they are from same species. • Sequencing environmental genomes, not a single organism. • Limited tools that sequence environmental genomes. • Algorithm: • Underlying algorithm determines the accuracy of match. • Performance can be highly improved. • Interfaces could be improved.
Problem • Genome assembly is a O(2k) problem. • Using Dynamic Programming it can be reduced. • Example in seconds: • Assembly that takes around 1125899906842624 seconds to can be reduced to 2500+ seconds!
A Start • We divide the problem in two steps • Acquiring subsets such that each subset represents an organism • Assembling this into a characteristic genome sequence • The above two steps to be obtained by • Clustering • Assembly
Steps Clustering Assembly Raw Data Assembled Sequences Characteristic Genome
The Data • CAVEEG (Cleaned Dataset): • Contains 128K Sequences • Assembled + Singletons • Length ranges from 200bp-1000bp
D2 Cluster • It is a software for clustering genome sequences • The technique is based on distance.
Clustering with D2 Cluster • Clustering was performed on 128K CAVEEG Dataset • One dataset with 100K Sequences was obtained • Majority of the data falls into one cluster • This makes the process of separating organisms hard • The clustering/assembly failed to assemble the sequences (the number of organisms were estimated manually and compared)
Problem with D2 clustering • Does not look for contigs • Ex: A same cluster may have: AATGCGTATTCGATGCGC CATACTTAGTCGATC – AG • When we assemble we desire: AATGCGTATTCGATGCGC TGCGCATCGTATCG
Problems • Since data is closely related the clustering technique assigns them to same cluster. • Existing tools are unable to assemble the data correctly. • The clustering software can only perform one round of clustering.
Ongoing Work • Genome assembly using dynamic programming • Uses Longest Common Sub-Sequence • LCS is commonly used (ex: Mummer) • We added restrictions • Enforce strict matches • Encoding of data
Then… • We added clustering. • Instead of comparing each sequence with each other we can compare them with a group. • Faster, less number of comparisons.
Clustering [3]
How much does it matter? • Obtaining an exact full length sequence it not essential • A sequence that is very close to the original is desired
Fuzzy Logic • Fuzzy Logic has been used extensively in approximate string matching using distance measures, etc. • However, very little work has been done in application of building genomes from subsequences of nucleotides. • The concept of similarity and application of fuzzy logic will be defined which is a relatively new area in nucleotide sequencing.
In Future.. • Compare technique with Phrap (alignment software, Mummer) • Improve clustering • Define Similarity using Fuzzy Logic • Define Dissimilarity • Parallelize the process
References [1] http://bioinformatics.oxfordjournals.org/cgi/content/full/21/8/1408#FIG1 accessed May, 2006. [2] DeLong EF (2002) Microbial population genomics and ecology. Curr Opin Microbiol 5: 520–524. [3] http://www.togaware.com/datamining/survivor/kmeans04.png