280 likes | 380 Views
Bioinformatics. Overview School of B&I TCD May 2010. Who, me?. Andrew Lloyd atlloyd@tcd.ie 087-225-9850, 053-9255717, 01-896-2450 Director INCBI 1993-2000 Population genetics, evolution Whole genome analysis Immunology, chickens, FIRM. Definition/scope.
E N D
Bioinformatics Overview School of B&I TCD May 2010
Who, me? • Andrew Lloyd • atlloyd@tcd.ie • 087-225-9850, 053-9255717, 01-896-2450 • Director INCBI 1993-2000 • Population genetics, evolution • Whole genome analysis • Immunology, chickens, FIRM
Definition/scope • Storage, retrieval and analysis of biological (sequence) information. • Insert better definition here • Case can be made for microarray analysis • NOT • ecoinformatics (ecology) • Image analysis • Bar-coding hospital sheets
Philosophy “Nothing worth learning can be taught” Oscar Wilde
Getting bioinformation • Type it in: A,T,C,C,G,T,C,A (1991) • Access databases • Literature (Pubmed) • Medical (OMIM) • DNA sequence (EMBL/GenBank) • Protein sequence (UniProt, SwissProt, PIR) • 3-D structure (PDB)
Annotation • In any DB, half is data and half context. • Gene ontology (language) • Parsing sequence (ORF, RBS, Intron, -helix) • Recognising similar sequences (evolution!) • Complementary info : DB cross-referencing • (DNA -> Protein -> 3D structure -> motifs)
Secondary databases • Protein motifs, domains, families • RNA structures (16S ribosomal RNA…) • Taxonomy/classification • Metabolic pathways (KEGG) • Enzymes (Brenda, TCD, Ireland) • SNPs: mutations and variants • Disease DBs (OMIM) • Immuno, epitope DBs
Complete genomes • Ensembl (complex, basically vertebrate) • Uniform look-and-feel; cross-refs • UCSC GoldenPath browser • Plants • Bacterial genomes • Including mitochondrial, chloroplast • Eubacteria vs Archaea vs Eukaryotes
Annotated/known genes • What does my gene do? • Blast (fasta) against the DB • SRS/Entrez to access databases • Neighboring (similar things in same DB) • DB cross-references • full picture of attributes • What biochemical pathway?
FullTextJournals OMIM GenBank/EMBLDNA Sequence UniProt Protein sequence PubMed Maps & Genomes Prosite Pfam PSSM PDB 3-D struct Taxonomy The territory
Databases • BIG • EMBL/GenBank 200Gbp, 100m entries, 2500 complete genomes, 200K species • Encycl. Britannica 180m letters. 40m words • EMBL 1km of Britannica Volumes • Doubling every 14-18 mo • Human genome is X bp?
Intrinsic vs Context Internal • DNA, protein sequence • DNA: Purine/Pyrimidine • AAs: small, hydrophobic, aromatic, polar • Variants: SNPs, Indels, Alt Splicing • 2ndry structure • DNA: stem/loops • Protein: helix, sheet, turn, loop
Intrinsic vs Context External, context for your molecule • In other species (homologs, phylog trees) • In which cell • In which cellular location (GO) • Molecular complex (dimers) • Which pathway (KEGG) • Where in genome (neighbors, synteny)
New Unknown Gene • Blast homology searching • Genomic location/neighboring genes • Where is it expressed? • How regulated (control sequences) • Intron/exon structure • Domain structure • Restriction sites etc. • Primer design
DNA/gene structure • Four bases A T C G U • 2 pyrimidine, 2 purine • LOTS of them: how many? • Open reading frame • 5’ signals, 3’ signals • Introns/exons • Neighbours (operons)
Two sequences • Alignment • Local • Global • Dotplot • Threading
One seq vs many • Homology search vs database • Special case of 2-seq alignment • Blast vs fasta • Limit by species/taxon • Substitution matrices • Low complexity masking
Multiple sequence alignment • MSA • Progressive alignment • ClustalW or (better) T-Coffee
Phylogenetic trees • Computationally intensive • Distance matrix methods • Neighbor-joining (NJ) • UPGMA • Minimum evolution • Maximum parsimony • Maximum likelihood • Bayesian methods
Genefinding • Special case of DNA analysis • How to annotate a genome • Bacterial • Find open reading frames (ORFs) • With start/stop codons • With promoter, RBS, CAAT, TATA • Eukaryotic • As above PLUS • Introns/exons • Alternative splicing
Typical mammalian gene structure miRNAs? Introns Start (ATG) Stop ControlRegion DNA gt.. …ag 5’ 3’ Exon 2 Exon 3 Exon 4 Exon 1 Introns “spliced out” and discarded RNA RNA Stop: TAG, TGA, TAA ATGCCCAGGAGATTTGGA . . . MetProArgArgPheGly . . . PROTEIN
Protein substructure • DNA makes protein and protein (enzymes) make everything else. • 20 Amino acids • Amino acid properties • Motifs • Domains • Biological units
Protein 3-D structure • Relationship between sequence & structure • Secondary structure • Alpha helix • Beta sheet • Coil • Turn • Threading sequence to homologous structure
Gene Expression • EST • SAGE • MicroArray • Clustering of same expressed genes
Genomics • Complete DNA seq for a species • Gene order • Gene clusters/operons • Missing operons • Gene duplication • Whole genome duplication (WGD)
SNPs • Key issue in genetics is that two organisms are both the same and different: • Humans vs chimps vs mouse • Parent vs offspring vs co-national vs human • Single nucleotide polymorphisms • Variation between individuals • Pharmacogenetics • Personal tailored medicine
Summary/take home • Course designed to give you access to databases, software tools • …and ways of thinking about data