The Poor Beginners’ Guide to Bioinformatics

The Poor Beginners’ Guide to Bioinformatics

What we have – and don’t have... • a computer connected to the Internet (incl. Web browser) • a text editor (Notepad or better) • public databases of genomic sequences • public databases of cDNA + EST • public databases of protein sequences, structures and motifs • money for specialised software packages • public servers capable of (almost) anything we wish to do

Dealing with a sequence: model tasks • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments

Notes on basic sequence handling • Make sure you have the correct format. • FASTA format is (almost) always correct. >sequencename thisisasequenceinfastaformat • If not, you can always use raw data. • If things don’t work, check for gaps in sequence, empty lines, and file extension. • BEWARE OF MICROSOFT!

Model tasks continued … • basic (DNA) sequence manipulation: restriction analysis, translation… • sequence similarity and pattern/motif searches • gene building: modelling exon-intron structures • protein domain searches,structure analysis • construction and interpretation of sequence alignments

FH3? FH1 FH2 Defining a gene family… • By overall domain structure • By domain sequence • Based on a peptide motif L-X-X-G-N-X-[ML]-N

Sequence comparison-based searches • Entrez “related sequences” • easy identification of “false starts” • no organism selection • BLAST/FASTA • all DNA/protein combinations • taxonomy selection possible • statistical data provided • domain structure comparison available • divergent motifs may be missed Two methods are better than one.

Notes on all sequence comparisons, searches, alignments… • Start with defaults (the authors know what they are doing)… • … BUT don’t be afraid to vary the parameters • Chose a reasonable scoring matrix: Distant sequences: low BLOSUM, high PAM Closely related sequences: low PAM, high BLOSUM

Motif-based searches • sensitive • no statistics • only protein databases can be searched • TAIR PatMatch • Arabidopsis- specific • Problematic user interface • ISREC - INSECTS • admirable technology • access to SwissProt and TrEMBL • no organism selection

Some genes are more alike than others… • A number of splicing prediction servers available • Agreement of different methods is a good sign but no absolute measure • Always align ESTs if possible • Beware of non-conventional intron boundaries (GC-AG instead of GT-AG) • Plant data for transcription start/factor binding sites prediction are limited

Searching for PROSITE patterns – allowing ambiguities PROSITE and Pfam profile searches SMART, CDsearch (domains and more) Searching for known domains/motifs

transmembrane segments prediction predicting signal peptides/anchors 2 methods available possibility to predict organelle localisation Predicting protein localisation

locally installed, free, for Mac and PC interactive domain definition statistical data provided may produce false-positive blocks (read the on-line manual!) “objective” results a number of servers available recommended for well-conserved proteins empiric parameters(e.g. gap penalties) bad for divergent sequences Alignment: “manual” or automated?

Phylogenetic analyses • Two methods are better than one. • Your phylogeny cannot be better than your alignment. • Gaps are no data. • Allways do bootstrapping (100-500 cycles) • Certain questions cannot be answered from an unrooted tree.

Points to take off... • go to the Bioinformatics page http://www2.rhul.ac.uk/~ujba110/Bioinfo.htm • select your exercise (A,B,C,D,E) • … and enjoy it! If you mean it seriously: • create your own bookmarks (seed provided on the course web page)

The Poor Beginners’ Guide to Bioinformatics