180 likes | 197 Views
Explore sequence assembly, analysis, database searching, alignment, structure prediction, and gene finding in bioinformatics. Learn text and sequence-based database searching methods with heuristic algorithms and alignment scoring.
E N D
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre
Common Computational Analyses • Sequence Assembly • Simple sequence analysis • Translation and reverse Complement, ORF • Composition statistics (protein & DNA) • Molecular mass • Total charge and pI; local hydropathy • Simple determination of secondary structures • Restriction site analysis • Internal repeat analysis • Detection of active sites, functional residues, characteristic structures, substrates, and processing signals
Common Computational Analyses • Database sequence search • Multiple alignment • 2° and 3° Structure prediction; transmembrane helix detection • Structure modeling • Docking prediction and design • Hidden Markov model searches
Database Searching • Text-based Database Searching -using a text string to match an annotation in a sequence database record, ie. Keyword search • Sequence-based Database Searching-using a biological sequence to match its whole or parts of its sequence to the sequences of every sequence database records
Text-Based Database Searching • Examples: Entrez, SRS, DBGET, AceDB- common integrated database systems • Search Concepts • Boolean Search - AND, OR, NOT • Broadening Search • Narrowing the Search • Proximity searching, soundex • Wild Card, Stemming eg. Thala* for thalasemia, thalassemia, thalassemic • Use standard string search algorithms and boolean operations, vocabulary matches
Text-based Database Searching • Example: To find the human homolog of the Drosophila per gene • Procedure • Web to Entrez • All Fields : enter "human" "per" • Hits returned, irrelevant - broaden search • "human" "period" - more hits • check every one, find the human RIGUI gene • Hit and miss, clever guess work, free form or controlled vocabulary (MeSH terms)?Use Boolean searches?
Sequence-based Database Searching • Homology Search • Global or Local Sequence Alignment • Needleman-Wunch Algorithm • Smith-Waterman Algorithm • Lipman - Pearson FASTA • Altschul's BLAST • Take a sequence, pairwise comparison with each sequence in the database
Sequence-based Database Searching • Basic Assumptions: • Sequences of homologous Genes/Protein diverge over time even though structure and/or function change little • Significant sequence similarity inferred as potential structural /functional similarity or common evolutionary origin • Based on well-characterised protein, infer the function of an unknown sequence at gene or protein sequence level.
Sequence-based Database Searching • Global Alignmentforces complete alignment of the pairwise comparison of the two input sequences • Local Alignmentlooks for local stretches of similarity and tries to align the most similar segments • Algorithms used may be similar, but output different, statistics needed to assess results
Sequence-based Database Searching • Alignment Scoring • Substitution score and substitution matrixPAM, BLOSUM • affine gap costs/gap penalty and gap scores • Optimal alignments, dynamic programmingNeedleman-Wunsch algorithm,Smith-Waterman algorithm (SSEARCH) • Additional heuristics to speed up the search - FASTA, BLAST
Some definitions • Affine gap costs - scoring system for gaps within alignments which charges a penalty for gap formation and additional per-residue penalty proportional to size of gap • Alignment score - numerical value indicating the overall quality of an alignment, the higher the better the alignment. • Algorithm - fixed procedure embodied in a computer program • Heuristics - a computer science term referring to guesses made by the program to approximate results, usually based on arbitrary or predefined rules. • Gapped Alignment - alignment of sequences where gaps are permitted
Computational Genefinding • Major challenge in genome project • Given a DNA sequence, where does a gene begin and stop? - ORF • Where are the exons and introns? • Where are the transcription elements? • Gene structure and other regulatory elements?
Genomic Elements • Intron-exon splice sites • Start-Stop codons • Branch Points • Promoters and terminators of transcription • Polyadenylation sites • ribosomal binding sites • Topoisomerase II binding sites • Topoisomerase I cleavage sites • Transcription factor binding sites
Detecting Genomic Elements • Local sites and motifs/patterns for such element - signals and signal sensors • Extended variable-length regions eg exons and introns- contents and content sensors • Linguistic technique - gene structure described in formal grammar - GeneLang genefinding program
Signal sensors • Simple consensus sequenceUse of Pattern matching algorithms • Weight matricesallow for weighted score for each weight matrix sensors to be summed • Use of Artificial Neural Networks (ANN)
Content Sensors • Long ORF for bacteria • Statistical models eg. Markov models - GeneMarkstatistical models of nucleotide frequencies and dependencies in codon structure • Neural Nets eg Grailexon detection by neural network combined with signal sensors for exon-intron splice sites
Some Definitions • Artificial Neural Nets - statistical pattern recognition method - a type of nonlinear regression • Markov Models - statistical models for sequences in which the probability of each residue depends on the residues preceding it. • Dynamic Programming - type of algorithm widely used for constructing sequence aligments and for evaluating all posible candidate gene structure
Other Genefinding methods • Use of dynamic programmingLinguistic rules for functional featuresParameters of a Markov Process on hidden variables - hidden Markov Models (HMM) • HMM genefinder - EcoParse, Xpound GeneMark HMM, Veil, HMMgene, GenScan