140 likes | 163 Views
A program to locate mRNA sequences on the mouse genome efficiently using FASTA format and localization programs like BLAT, SSAHA, or MegaBLAST. Matches defined by start/end pairs on mRNA and chromosome/start/end tuples on genome. Implements object-oriented design principles for seamless operation.
E N D
Object-oriented Design and Programming Conrad Huang PC204, Fall 2004
Requirements • I have a bunch of mRNA sequences and I want to know where they are located on the mouse genome. • I want a fast method because I plan to do this with lots of data sets. This is probably better than most requirement statements that you’re likely to get
Specification • Genomic and mRNA sequences are stored in FASTA format • Matches between genomic and mRNA (sub)sequences are found using a localization program, e.g., BLAT, SSAHA or MegaBLAST • Each match is defined by a (start, end) pair on the mRNA and a (chromosome, start, end) 3-tuple on the genomic sequence • The genomic location of an mRNA sequence is defined by a set of matches that maximally covers the mRNA Imprecise, but workable. Needs statement of what constitutes acceptable results.
Objects FASTA file Genomic sequence mRNA sequence Match Genomic location Operations Read sequences from file Get matches from localization program Collate matches into genomic location Design (OO) Not uh-oh. O-O.
Relationships FASTA file FASTA file Genomic sequences mRNA sequences Matches Genomic locations Arrow direction indicates reference or composition
Classes • FastaFile • used for reading both genomic and mRNA sequences • Sequence • represents either genomic or mRNA sequence • Match • obtained from localization program output • GenomicLocation • either obtained from localization program output (BLAT) or composed from matches using our own algorithm (SSAHA or BLAT) • Localization program output parser Some classes, like FastaFile, can serve as the implementation of more than one concept (file of genomic sequences and file of mRNA sequences)
Class Methods • FastaFile • read(filename) • parse FASTA file content into a list of Sequence instances • Sequence • None • data derived by localization program parser
Class Methods (cont.) • LocalizationOutputParser • localize(sequence) • run localization program and parse output • if using BLAT, identify genomic location and matches from output • if using SSAHA or MegaBLAST, get list of matches from output and compute genomic location
Class Methods (cont.) • Match • None • data filled in by LocalizationOutputParser • GenomicLocation • None • data filled in by LocalizationOutputParser
Instance Attributes • Some attributes are dictated by the class • FastaFile must have a list or dictionary of sequences • These may be accessible externally • Some attributes are dictated by the operation • Reading a FASTA file might use a variable to keep track of the last line read • These are often for internal use only • Defining actual attributes in our classes is left as an exercise for the reader Yeah, I ran out of steam here, and there are already too many slides.
Module Organization • sequence.py • defines FastaFile and Sequence • location.py • defines Match and GenomicLocation • blat.py, ssaha.py, megablast.py • defines BlatParser, SsahaParser and MegablastParser, respectively Keep classes that cannot stand independently in the same module
Design (finishing touches) • Select implementation algorithms • file parsers (FASTA or localization output) are sometimes available on the Internet • recursion and result caching for generating genomic location from list of matches If this task is too big, you need to partition the problem further
Implementation • Coding, testing and debugging • Start with the class skeletons • Write test code for each module • Test modules separately when possible • Test early and often Well, it’s about time! Project is due in 10 minutes.
Roll Out • Release product to user • include User’s Guide • description of options • example usage • test cases The code is, of course, already completely documented.