250 likes | 377 Views
Geometric Crossovers for Supervised Motif Discovery . Rolv Seehuus NTNU. Motivation and Scope. Try out the applicability of the geometric framework, on a supervised motif discovery problem Compare its merits to a previously used operator. In practice, we test on a very easy problem
E N D
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU
Motivation and Scope • Try out the applicability of the geometric framework, on a supervised motif discovery problem • Compare its merits to a previously used operator. • In practice, we test on a very easy problem • that existing software can solve easily • Value as test case • Building block for more complex motif discovery problems, that current algorithms can not solve satisfactory
Motif Discovery • Has become a standard problem in bioinformatics • Given a set of sequences, figure out what is special with it • …by eliciting motifs in the dataset • Differing by… • Motif model • Learning algorithms • Scoring functions
The Standard Approach • Do analysis of the positive set of sequences • …background distribution… • …information content… • …statistical significance… • Report motif
Motif discovery as a classification problem • Always at least two datasets: The positive, and “the rest” • Choose a negative dataset • Report motifs best suited to discriminate • No need to learn a background model • The statistical significance of the motif can be given • Discriminative motif discovery has received increased attention lately
Classification problem • Protein sequences, from the SwissProt database • Classified according to protein family (as specified in the Prosite database) • Selected six families, that previously have been shown to be hard to classify under similar circumstances. • Some of the families can be said to have an overrepresented motif as the ones we can train on
The Potential Negative Data Set • Huge, compared to the negative • Quite common in bioinformatics, and an interesting problem to cope with in its own right • In field: • randomly generated sequences • one set of randomly selected sequences • random rearrangement of the positive sequences (data not shown) • The “best practice” was to select the samples randomly from the negative set each generation, so that their size matches the positive set.
Motif Model • Twenty amino acids • Wildcard C...C.C..C DMEGACGGSCACSTCHVIVDP Motif match, positive sequence
Operators on Motifs • Unit edit move as mutation Mut(A) = {Insert, Delete or Replace a token} • Substring Swapping Crossover (for comparison) • Two-point Geometric Crossover
Geometric Crossover • Search space have a metric • Mutation is a move in search space • Crossover yield children found on the shortest path between the parents in search space • Successfully applied to other problems
Geometric Crossover for Motifs • View motifs as sequences • Basic assumption: The edit distance is a good way to move around in motif space • A crossover based on the edit distance, should yield a good crossover for motif discovery • We (arbitrarily) choose unit costs for insertions, deletions and substitutions
Sequence Alignment • Alignment: put spaces (-) in both sequences such as they become of the same length Seq1’= agcacac-a Seq2’=a-cacacta • Score: 2 • An Optimal alignment is an alignment with minimal score • The score of the optimal alignment of two sequences equals their edit distance • There often are multiple optimal alignments
Homologous Crossover • Pick an optimal alignment for two parent sequences • Generate a crossover mask as long as the alignment • Recombine as traditional crossover • Remove dashes from offspring Child1’= BANANAS Child2’= ANANA Mask = 1101100 Seq1’= BANANA- Seq2’= -ANANAS SeqA’= BANANAS SeqB’= -ANANA-
Experiments • Two crossovers with same parameters, and mutation only • Ten fold cross validation: • Partitioned datasets in ten pieces • Trained on 9/10ths • Tested the best motif on the remaining test set • Trained on randomly selected subset of SwissProt • Tested on entire SwissProt • Fitness: Scaled Pearson correlation of confusion matrix
Cytochrome • Include the following fragment of a highly conserved motif: C…CH • Which geometric crossover find • While substring swapping finds: CH • Conservation of length keeps us in the correct ballpark • CH representa local maximum for substring swap
Ferredoxin • Contains the following motif: C..C..C...C[PH] • Which Substring Swap finds • While Geometric Crossover don’t • Conservation of length keeps us from finding the correct motif
Classification Performance Medians, of 10 experiments, for each family
Classification Performance - II • Similar for all operators • Maybe a slight advantage, for the geometric crossover if we have • A highly conserved motif exist • A “ballpark guess” on motif length • Surprisingly, mutation frequently outperforms the other operators
Concluding remarks • The geometric operator is promising - need work • It is more length preserving than substring swap • The geometric operator need a good guess on motif length • Edit move might not be optimal for motif discovery? • even though, it for some problems shows merit. • Our initial assumption imply an insertion/deletion equally often as replacement in sequence data • we are WAY off on that parameter
Future Work • Synthetic data with known parameters • Include character classes and within motif gaps in representation • Modules (composite motifs) • Expand to position weight matrixes