330 likes | 869 Views
Tools for BioInformatics. Eileen Kraemer Computer Science Dept. The University of Georgia. Sequence data. Types of Tools. Lab samples. Production Sequencing Software . Databases, Database Search Tools. Production Sequencing Software.
E N D
Tools for BioInformatics Eileen Kraemer Computer Science Dept. The University of Georgia
Sequence data Types of Tools Lab samples Production Sequencing Software Databases, Database Search Tools
Production Sequencing Software • used throughout the sequencing procedure from preparation of the DNA through to the finishing of clones.
Example: Sanger Centre,Shotgun Sequencing of typical human clone • Data collection • Transfer to UNIX • Gel image processing • Sequence pre-processing • DNA Fragment Assembly • Editing • Finishing Services • Quality Control and Assesment
Databases • Swiss-Prot • EMBL • Entrez • GDB • GenBank • GSDB • PDB • & more -- see links at: http://www.public.iastate.edu/~pedro/rt_1.html
Species-specific Databases • See: http://genetics.about.com for both: • Non-human and human genome projects • Examples: • PomBase is a compilation of data relating to the organism Schizosaccharomyces pombe • Wormpep predicted proteins from the C. elegans genome sequencing project.
Annotation Tools • Annotation of sequences with info such as homologies to known genes, possible gene locations, gene signals such as promoters, etc. • Example: Genotator (Nomi Harris) -- developing a workbench for automatic sequence annotation and annotation viewing and editing. The goal is to run a series of sequence analysis tools and display the results in such a way that the various predictions can be compared, and researcher makes decision of what to include.
Database Software • ACEDB is an acronym for "A Caenorhabditis elegans DataBase". It can refer to a database and data concerning the nematode C. elegans, or to the database software alone. • Other groups may adapt existing, or create own. For example, David Hall’s workflow project at UGA for Neurospora
Sequence Function Structure Types of Tools
Gene Prediction • Caution: accuracy <= ~ 70% • Good review: Snyder and Stormo, (chapter 11 of the book Nucleic Acid and Protein Sequence Analysis: A Practical Approach, second edition, 1994. )
Gene Prediction • GRAIL(Xgrail, JavaGrail, etc.) • Geneid • Netgene • GenMark • Fexon, Hexon • GENSCAN • xpound • Genefinder (University of Washington)
GRAIL • Predicts coding regions • Uses a neural network which combines a series of coding prediction algorithms. • recognizes coding potential within a fixed size (100 base) window; evaluates coding potential without looking for additional features • later versions incorporate additional info • human and other species
GeneMark • Based on inhomogeneous Markov models • predicts coding and non-coding regions based on statistical patterns in dinucleotide frequences … more next week from Mark B.
Sequence Alignment • Pairwise alignments • Multiple sequence alignments
Pairwise Alignments • SIM (Protein only) - k best non-intersecting alignments (EXPASY) • ALIGN - optimal global alignment with no short-cuts (EERIE) • LALIGN - calculates the N-best local alignments (EERIE) • LFASTA - local similarity searches showing local alignments (EERIE) • BLAST 2 - local alignment using BLAST (NCBI) • LAP2 - local DNA to protein alignment with LAP2 (MTU)
Multiple Sequence Alignments • ClustalW 1.7 (DNA/Protein) - Global progressive (BCM) • CAP Sequence Assembly (DNA) - Contig Assembly • MAP (DNA/Protein) - Global progressive in linear space • PIMA 1.4 (Protein only) - Pattern-Induced (local) Multiple Alignment (BCM) • MSA 2.1 (Protein only) - Near-optimal sum-of-pairs global (WashU) • BLOCK MAKER (Protein only) - Finds conserved blocks in seq sets (FHCRC)ClustalW 1.7 (DNA/Protein) - Global progressive (BCM) • MEME 2.2 (DNA/Protein) - Multiple EM for Motif Elicitation (SDSC)
Similarity Searching • BLAST -- (BLASTP, TBLASTN, etc.) • a nucleotide or protein sequence sent to the BLAST server is compared against and a summary of matches is returned to the user. • allows all combinations of DNA or protein query sequences with searches against DNA or protein databases:
BLAST variations • blastp compares an amino acid query sequence against a protein sequence database. • blastn compares a nucleotide query sequence against a nucleotide sequence database. • blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. • tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). • tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.
Sequence Function Structure Types of Tools
Protein Structure Prediction • Ab initio -- based on energy minimization • fold recognition -- sequence -> secondary structure, then align secondary structures with corresponding secondary structures in related proteins, etc. • statistical -- based on “hidden patterns”; similar patterns -> similar structure
Protein Secondary Structure Prediction • Coils - prediction of coiled coil regions • nnPredict - uses a 2 layer neural network • PSSP / SSP - segment-oriented prediction • PSSP / NNSSP - nearest-neighbor prediction • SAPS - statistical analysis of protein sequences • Paircoil - coiled coil regions of pairwise residue correlations • Protein Hydrophilicity /Hydrophobicity • SOPM - self optimized prediction method
Sequence Function Structure Types of Tools
Protein Function Prediction • Pfam - • groups of similar function proteins aligned and HMMs generated for each “cluster” • HMM generated for unknown function protein and compared to HMMs of known proteins for predicted function classification
Pfam components • PROTEIN HMM SEARCH - Analyze a protein query sequence to find Pfam domain matches. • DNA HMM SEARCH - Analyze a DNA query sequence to find Pfam domain matches. (Uses the GeneWise server at the Sanger Centre.) • BROWSE PFAM - View Pfam annotation and alignments. • TEXT SEARCH - Query Pfam by keywords. • BROWSE SWISSPFAM - View the domain organization of any SWISSPROT/TrEMBL sequence according to Pfam.
Types of Tools Across organisms … Phylogeny Reconstruction Sequence Sequence Sequence Sequence
Phylogeny Reconstruction • Construct evolutionary trees based on divergences that occur in related sequences • parsimony, minimum distance, etc. • parsimony -- construct tree so that number of mutation events is minimized • PHYLIP, PAUP, others, some interactive
Visualization Tools • Database viewers • Sequence viewers • Molecular viewers
Physical Mapping Software • used to physically locate genetic markers. • FPC Software for FingerPrinting Contigs. • Image 3.x Software for processing fingerprint gel images. • RHServer This web interface positions one or more markers on the 1998 International Gene Map (GB4). • SAM System for Assembling Markers. SAM takes as input a set of clones and their associated markers, and outputs a partially ordered marker map. • Z-RHMAPPER Extensions to the RHMAPPER (Whitehead) Radiation Hybrid Mapping Package.
Good Resources • Pedro’s BioMolecular Research http://www.public.iastate.edu/~pedro/rt_1.html • BCM pages www.hgsc.bcm.tmc.edu/SearchLauncher/index.html • Sanger Center www.sanger.ac.uk/Software/Sequencing/overview.shtml • Mining Co. Web Site • genetics/miningco.com • & many others