310 likes | 464 Views
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core. An introduction to DNA and Protein Sequence Databases. Questions to address . What are the main sequence databases? Which one to use for: Looking up a gene name/identifier from a paper Identifiers What should I use and why?
E N D
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases
Questions to address • What are the main sequence databases? • Which one to use for: • Looking up a gene name/identifier from a paper • Identifiers • What should I use and why? • Coordinate based systems • Annotation • Protein domains • Gene Ontology
Database Varieties • Sequence Warehouses • “everything under one roof” • Genome Databases • Containing single genome dataset(s) • Reference Sets • Often human curated, the 'standard' for a particular gene or protein from which variants are defined • Specialist • Short reads from next generation sequencing (Short read archive) • [EST] Expressed sequence tags and [GSS] Genome survey sequence
Sharing primary data NCBI GenBank EMBL DDBJ
NCBI • Warehouse • GenBank <live demo> • NR dataset : NR = non redundant (but is is not..) • Reference Dataset • RefSeq • Genome Datasets • NCBIGenomes
EMBL • Warehouse • EMBL • Historically • Protein set was call translated EMBL (trEMBL) • Gold standard reference set was called SwissProt • Reference set = Uniprot • UniProtKB/Swiss-Prot • Manually annotated and reviewed • UniProtKB/TrEMBL • automatically annotated and not reviewed • Genome database • Ensembl <live demo>
Live Demo • Search GenBank for human adh4 • How many are there? • How many should there be? • Why are some different to those found in Uniprot? • Are there better databases to use? • Which identifier should you use in your lab book?
We should now be able to answer these: • What are the main sequence databases? • Which one to use for: • Looking up a gene identifier from a paper • Searching for a gene name • Searching for an orthologus genes from another species
Or what to write in your lab book Identifiers
How to identify a feature • Gene/protein name • Common name • Standardised Name • Database identifier • Unique for each database • Some have revision numbers • Position in genome • Dependant on Genome build • Position in a Gene/Protein • Protein Domains
Consortia identifiers • Most key species have a consortia / group / community that provides the key identifiers in the field • Humans • Was HUGO (HUman Genome Organisation) • now the HGNC (Human Genome Nomenclature Committee)
Database Identifiers • Every dataset has their own system of identifying gene/protein • Example: Human ADH4 • Ensembl • ENSG00000198099 ENST00000423445 ENSP00000397939 • SwissProt • ADH4_HUMAN P08319 • RefSeq • NM_000670.3 NP_000661.2 • GenBank • gi|71565152|ref|NP_000661.2|
Keeping Track of Changes • Gene models can change • Will the id you used yesterday still get the same sequence today? • Or: How to you get the latest version of a sequence?
Keeping Track of Changes • Genbank: GI or “genbank identifier” • Gi number changes each time, often removed when it gets superseded • SwissProt: Accession and ID • Accession changes each time (P08319) but the ID remains constant (ADH4_HUMAN) • RefSeq and Ensembl • Revision based ids • NM_000670.3 ENSG00000198099.1 • XXX.number • XXX always retrieve latest • XXX.number retrieves the version
Demo: Ensembl Definining: Chromosome coordinates
Chromosome Positions • Features identified by Chromosome & position • File formats: BED, WIG, gff .. • All major genome databases store features as coordinates • Ubiquitous in deep sequencing studies • Note: coordinates change depending on the assembly • Always note the build number of the genome assembly if you are using coordinates
Coordinates • New concept of PATCH • This is an assembly update without changing the primary sequence • However additional 'improved' contigs map to the reference • These will be in the net assembly: you may wish to use them • Genome assembly names can differ by institution but are the same underlying sequence: • GenBank/UCSC • DEMO liftOver
Protein Domains • Interpro • Site that stores information on known protein domains from different projects • Covered by Interpro • Similarities between proteins • Conserved region in an alignment • Conserved protein folds • Not Covered by Interpro • Predicted features on primary protein sequence • Trans-membrane regions • Low complexity regions • Phosphorylation sites
Domain Complexity Many different types of domains x Many different projects identifying them = Vast amounts of domain based data
Old way of interacting with a database Request information Retrieve information From single source
DAS clients • Different type of software can have a DAS client build-in • Genome Browsers: ensembl, IGB, IGV.. • Multiple Alignment editors: Jalview, STRAP • 3D Structures: Spice • 3D electron microscopy data: PeppeR Demo
Annotation • Problem: Many ways to name a gene • Reductase = oxidase = dehydrogenase • Gene Ontology Consortium [GO] • GO terms standardise naming • Note that errors may still occur in the assignment of terms • Found in RefSeq, UniProt and most genome databases • GO browsers e.g. AmiGO
Gene Ontology • all [535063 gene products] • GO:0008150 : biological_process • [404412 gene products] • GO:0005575 : cellular_component • [372379 gene products] • GO:0003674 : molecular_function • [436597 gene products]
Evidence Codes • Experimental • # EXP: Inferred from Experiment # IDA: Inferred from Direct Assay • # IPI: Inferred from Physical Interaction # IMP: Inferred from Mutant Phenotype • # IGI: Inferred from Genetic Interaction # IEP: Inferred from Expression Pattern • Computational • # ISS: Inferred from Sequence or Structural Similarity • # ISO: Inferred from Sequence Orthology # ISA: Inferred from Sequence Alignment • # ISM: Inferred from Sequence Model # IGC: Inferred from Genomic Context • # RCA: inferred from Reviewed Computational Analysis • Author Statement • # TAS: Traceable Author Statement # NAS: Non-traceable Author Statement • # Curator Statement Evidence Codes # IC: Inferred by Curator • # ND: No biological Data available • Automatically-assigned • # IEA: Inferred from Electronic Annotation
Best annotation? • Use DAS clients to get more information on genomic, gene or protein features • Protein Domains are especially useful • The Gene Ontology is useful for general classification • BUT be aware from where the annotation was derived