1 / 12

Terminology

An Introduction to Bioinformatics . CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546 Oct/12/09 Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt. Terminology.

asta
Download Presentation

Terminology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Bioinformatics.CSE, Marmara University mimoza.marmara.edu.tr/~m.sakalli/cse546Oct/12/09Source http://bio.fsu.edu/~stevet/BSC5936/BioDataBases.ppt

  2. Terminology • Bioinformatics: using computational techniques to access, analyze, and interpret the biological information. Tool Building. Biocomputing and computational biology are the synonyms. • Sequence analysis is the study of molecular sequence data. • Genomics analyzes the context of genes or complete genomes. • Proteomics is the subdivision of genomics concerned with analyzing the protein complement, i.e. the proteome. • The Human Genome Project and numerous the data coming at alarming rates. • Homo sapiens the 3.2 billion base pairs: Estimates of the number of genes were around 100,000 range; but turns out to be twice as many as a fruit fly, between 25’ and 35,000! • The protein coding region of the genome is only about 1% or so, a bunch of the remainder is ‘jumping’ ‘selfish DNA’ of which much may be involved in regulation and control.

  3. Three major databases with their own specific format. Mirrored among each other and sharing accession codes, but NOT identifier names: • 1) National Center for Biotechnology Information (NCBI),/the National Library of Medicine (NLM), at the NIH, (Gene bank and GenPept). • http://www.ncbi.nlm.nih.gov/ • http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html • Georgetown University’s National Biomedical Research Foundation Protein Identification Resource and Naval Research Lab sequences of three-dimensional structure. • http://www-nbrf.georgetown.edu/ • http://www-nbrf.georgetown.edu/pirwww/dbinfo/nrl3d.html • 2) • European Molecular Biology Laboratory • http://www.ebi.ac.uk/embl/index.html, http://www.embl-heidelberg.de/ • European Bioinformatics Institute, • http://www.ebi.ac.uk/ • Swiss Institute of Bioinformatics’ (SIB), Expert Protein Analysis System • http://www.expasy.ch/, http://www.expasy.org/links.html • Nucleotide Sequence Database, amino acid sequence databases • http://expasy.cbr.nrc.ca/sprot/ • 3) • http://www.ddbj.nig.ac.jp/ • The National Institute of Genetics, DNA Data Bank of Japan.

  4. Atlas of Protein Sequence and Structure: The first well recognized protein sequence database, mid sixties, by Dr. Margaret Dayhoff. • DDBJ began in 1984, GenBank in 1982, and EMBL in 1980. They are all attempts at establishing an organized, reliable, comprehensive and openly available library of genetic sequences. • Each program needs to recognize particular aspects of the sequence files; flexibility of the program is a headache. NCBI’s ASN.1 format and its Entrez interface attempt to reduce these prbls. • Unfortunately, not like ieee working groups for internet taskforce, RFCies for example, format issues are the most confusing and troubling aspect of working with primary sequence data. • Sequence database installations are commonly a complex ASCII/Binary mix, but neither relational nor OOP (often proprietary). • Contain several very long text files each containing different types of information all related to particular sequences. • Software is usually required to interact with these databases. ReadSeq of Don Gilbert (a reformatting program, for DNA and protein sequences, accepting single or multiple inputs in 18 different formats, converting to a specified format. )

  5. http://www.molecularevolution.org/ • AWTY (Are We There Yet?) is a system for graphically exploring convergence of Markov Chain Monte Carlo (MCMC) chains in Bayesian phylogenetic inference (Nylander et al. 2008). • FigTree to graphically view phylogenetic trees. • Clustal W (Thompson et al. 1994) is for global multiple sequence alignment. Using a progressive alignment algorithm with affine gap penalties and a guide tree based on sequence similarity to align DNA or amino acid sequences. The affine gap cost model penalizes insertions and deletions using a linear function in which one term is length independent, and the other is length dependent. Gap penalty = Gapopen + Len * Gapextend. Recent reviews comparing multiple alignment algorithms (e.g., Hickson et al. 2000, Thompson et al. 1999, and McClure et al. 1994). Morrison and Ellis (1997) discuss the effects of nucleotide sequence alignment on the estimation of phylogenetic hypotheses. The current version is Clustal W2 (Larkin et al. 2007). The program is also available with a graphical user interface, Clustal X. • BEAST, (Beauti), -Bayesian Evolutionary Analysis Sampling Trees- is for evolutionary inference of molecular sequences, Andrew Rambaut and Alexei Drummond (Drummond et al. 2002; 2005; 2006). • FASTA compares pairs of protein or DNA sequences as well as comparing a single protein or DNA sequence to a database or library. Fast and local or remote services. • GARLI (Genetic Algorithm for Rapid Likelihood Inference) performs phylogenetic searches on aligned nucleotide datasets using the maximum likelihood criterion. • MAFFT implements FFT to optimize protein alignments based on physical properties of the amino acids (Katoh et al., 2002; 2005). The program uses progressive alignment followed by refinement, also known as iterative alignment.

  6. All sequence databases contain (in their own format): • Name (Genetic identifiers): LOCUS, ENTRY, ID • Definition: A brief, one-line, textual sequence description. • Accession Number: A constant data identifier. • Source and classification (taxonomy) information. • Complete literature references. • Comments and keywords. • The all important FEATURE table! • A summary or checksum line. • The sequence itself.

  7. LOCUS HSEF1AR 1506 bp mRNA linear PRI 12-SEP-1993 • DEFINITION Human mRNA for elongation factor 1 alpha subunit (EF-1 alpha). • ACCESSION X03558 • VERSION X03558.1 GI:31097 • KEYWORDS elongation factor; elongation factor 1. • SOURCE human. • ORGANISM Homo sapiens • Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. • REFERENCE 1 (bases 1 to 1506) • AUTHORS Brands,J.H., Maassen,J.A., van Hemert,F.J., Amons,R. and Moller,W. • TITLE The primary structure of the alpha subunit of human elongation…… • JOURNAL Eur. J. Biochem. 155 (1), 167-171 (1986) • MEDLINE 86136120 • FEATURES Location/Qualifiers • source 1..1506 • /organism="Homo sapiens" • /db_xref="taxon:9606" • CDS 54..1442 • /note="EF-1 alpha (aa 1-463)" • /codon_start=1 • /protein_id="CAA27245.1" • /db_xref="GI:31098" • /db_xref="SWISS-PROT:P04720" • /translation="MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEK • EAAEMGKGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIKNM • ……VTKSAQKAQKAK" • BASE COUNT 412 a 337 c 387 g 370 t • ORIGIN • 1 acgggtttgc cgccagaaca caggtgtcgt gaaaactacc cctaaaagcc aaaatgggaa • 61 aggaaaagac tcatatcaac attgtcgtca ttggacacgt agattcgggc aagtccacca………. • 1501 aactgt • // • GenBank and GenPept format

  8. EMBL and SWISS-PROT • ID EF11_HUMAN STANDARD; PRT; 462 AA. • AC P04720; P04719; • DT 13-AUG-1987 (Rel. 05, Created)…… • DE Elongation factor 1-alpha 1 (EF-1-alpha-1) (Elongation factor 1 A-1) • DE (eEF1A-1) (Elongation factor Tu) (EF-Tu). • GN EEF1A1 OR EEF1A OR EF1A. • OS Homo sapiens (Human), • OS Bos taurus (Bovine), and • OS Oryctolagus cuniculus (Rabbit). • OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; • OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. • OX NCBI_TaxID=9606, 9913, 9986; • RN [1] • RP SEQUENCE FROM N.A. • RC SPECIES=Human; • RX MEDLINE=86136120; PubMed=3512269; • RA Brands J.H.G.M., Maassen J.A., van Hemert F.J., Amons R., Moeller W.; • RT "The primary structure of the alpha subunit of human elongation …. -binding sites."; • RL Eur. J. Biochem. 155:167-171(1986).…… • CC -!- FUNCTION: THIS PROTEIN PROMOTES THE GTP-DEPENDENT BINDING OF • CC AMINOACYL-TRNA TO THE A-SITE OF RIBOSOMES DURING PROTEIN • CC BIOSYNTHESIS. • CC -!- SUBCELLULAR LOCATION: Cytoplasmic. • CC -!- TISSUE SPECIFICITY: BRAIN, PLACENTA, LUNG, LIVER, KIDNEY, • CC PANCREAS BUT BARELY DETECTABLE IN HEART AND SKELETAL MUSCLE. • CC -!- SIMILARITY: BELONGS TO THE GTP-BINDING ELONGATION FACTOR FAMILY. • CC EF-TU/EF-1A SUBFAMILY…… • DR EMBL; X03558; CAA27245.1; -…… • DR PIR; S18054; EFRB1…… • DR HSSP; Q01698; 1TUI…… • DR InterPro; IPR004160; GTP_EFTU_D3. • DR Pfam; PF00009; GTP_EFTU; 1…… • DR PROSITE; PS00301; EFACTOR_GTP; 1. • KW Elongation factor; Protein biosynthesis; GTP-binding; Methylation; • KW Multigene family. • FT NP_BIND 14 21 GTP (BY SIMILARITY). • FT NP_BIND 91 95 GTP (BY SIMILARITY). • FT NP_BIND 153 156 GTP (BY SIMILARITY). • FT MOD_RES 36 36 METHYLATION (TRI-). • FT MOD_RES 55 55 METHYLATION (DI-). • FT MOD_RES 79 79 METHYLATION (TRI-). • FT MOD_RES 165 165 METHYLATION (DI-). • FT MOD_RES 318 318 METHYLATION (TRI-). • FT BINDING 301 301 ETHANOLAMINE-PHOSPHOGLYCEROL. • FT BINDING 374 374 ETHANOLAMINE-PHOSPHOGLYCEROL. • FT CONFLICT 83 83 S -> A (IN REF. 2). • FT CONFLICT 232 232 L -> V (IN REF. 3). • SQ SEQUENCE 462 AA; 50141 MW; D465615545AF686A CRC64; • MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKEAAEMG KGSFKYAWVL • DKLKAERERG …… VTKSAQKAQK AK • //

  9. PIR/NBRF format • ENTRY EFHU1 #type complete iProClass View of EFHU1 • TITLE translation elongation factor eEF-1 alpha-1 chain - human • ALTERNATE_NAMES translation elongation factor Tu • ORGANISM #formal_name Homo sapiens #common_name man • #cross-references taxon:9606 • DATE 30-Jun-1988 #sequence_revision 05-Apr-1995 #text_change….. • ACCESSIONS B24977; A25409; A29946; A32863; I37339 • REFERENCE A93610 • #authors Rao, T.R.; Slobin, L.I. • #journal Nucleic Acids Res. (1986) 14:2409 • #title Structure of the amino-terminal end of mammalian elongation… • #accession B24977 • ##molecule_type mRNA • ##residues 1-82,'A',84-94 ##label RAO • ##cross-references EMBL:X03689; NID:g31109; PIDN:CAA27325.1; • PID:g31110……. • GENETICS • #gene GDB:EEF1A1; EEF1A; EF1A • ##cross-references GDB:118791; OMIM:130590 • #map_position 6q14-6q14 • #introns 48/3; 108/3; 207/3; 258/1; 343/3; 422/1 • CLASSIFICATION SF003007 • #superfamily translation elongation factor Tu; translation elongation • factor Tu homology • KEYWORDS GTP binding; methylated amino acid; nucleotide binding; • P-loop; phosphoprotein; protein biosynthesis; RNA binding • FEATURE • 1-223 #domain eEF-1 alpha domain I, GTP-binding #status • predicted #label EF1\ • 8-156 #domain translation elongation factor Tu homology • #label ETU\ • 14-21 #region nucleotide-binding motif A (P-loop)\ • 153-156 #region GTP-binding NKXD motif\ • 245-330 #domain eEF-1 alpha domain II, tRNA-binding • #status predicted #label EF2\ • 332-462 #domain eEF-1 alpha domain III, tRNA-binding • #status predicted #label EF3\ • 36,55,79,165,318 #modified_site N6,N6,N6-trimethyllysine (Lys) • #status predicted\ • 301,374 #binding_site glycerylphosphorylethanolamine • (Glu) (covalent) #status predicted • SUMMARY #length 462 #molecular_weight 50141 • SEQUENCE • 5 10 15 20 25 30 • 1 M G K E K T H I N I V V I G H V D S G K S T T T G H L I Y K • 31 C G G I D K R T I E K F E K E A A E M G K G S F K Y A W V L • 61 D K L K A E R E R …... Q K A Q K A K

  10. Examples of DBs with specialized type of sequences • Almost all the links Human Genome Ensemble Project at http://www.ensembl.org/ • Patterns, motifs, and profiles: REBASE, EPD, PROSITE, • Aligned multiple sequence entries. RDP and ALN. • Functionally, structurally, or phylogenetically ordered iProClass and HOVERGEN vertebrate gene db. • HIV Database, and the Giardia lamblia Genome Project. • 3D Structure, atomic coordinate data is necessary to define the tertiary shape of a particular biological molecule. Protein DB and Rutgers Nucleic Acid Db. • MolBio Molecular visualization with special software. • Genomic linkage mapping databases for H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces, Arabidopsis, E. coli. • OMIM — Online Mendelian Inheritance in Man • Phylogenetic Tree Databases: e.g. the Tree of Life. • Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet KEGG (the Kyoto Encyclopedia of Genes and Genomes). • Check the links given below..

  11. There’s a bewildering assortment of different databases and ways to access and manipulate the information within them. The key is to learn how to use that information in the most efficient manner. • For example: Given a novel genome sequence, find all genes and p-genes. • I want to design "sequence capture" probes for the exons of 40 genes that cause RP. • Obtain the exonic sequence, with at least 100 nt's flanking, and 1000 nts of the promoter from transcription start • I propose a new way to find disease-causing mutations in humans. I want to only look in genes that have regions that are 1) highly conserved across species, 2) have known functional protein domains (ex. transmembrane domains), and 3) have mRNA secondary structure. Is this a good idea? • 1859 of Charles Darwin’s The Origin of Species • Basic Mendelian Genetics • Mendel’s laws • independent assortment • independent segregation • mitosis and meiosis • dominant/recessive and pedigrees (the graphs of phenotype) • alleles • Basic molecular genetics • DNA • RNA • proteins • Central Dogma • genes and gene structure • cells and chromosomes • Principles of Genetics, Tamarin

  12. Pearson FastA format —GCG single sequence format — • >EFHU1 PIR1 release 71.01 • MGKEKTHINIVVIGHVDSGKSTTTGHLIYKCGGIDKRTIEKFEKEAAEMG • KGSFKYAWVLDKLKAERERGITIDISLWKFETSKYYVTIIDAPGHRDFIK • NMITGTSQADCAVLIVAAGVGEFEAGISKNGQTREHALLAYTLGVKQLIV • GVNKMDSTEPPYSQKRYEEIVKEVSTYIKKIGYNPDTVAFVPISGWNGDN • MLEPSANMPWFKGWKVTRKDGNASGTTLLEALDCILPPTRPTDKPLRLPL • QDVYKIGGIGTVPVGRVETGVLKPGMVVTFAPVNVTTEVKSVEMHHEALS • EALPGDNVGFNVKNVSVKDVRRGNVAGDSKNDPPMEAAGFTAQVIILNHP • GQISAGYAPVLDCHTAHIACKFAELKEKIDRRSGKKLEDGPKFLKSGDAA • IVDMVPGKPMCVESFSDYPPLGRFAVRDMRQTVAVGVIKAVDKKAAGAGK • VTKSAQKAQKAK !!AA_SEQUENCE 1.0 P1;EFHU1 - translation elongation factor eEF-1 alpha-1 chain - human N;Alternate names: translation elongation factor Tu…… F;1-223/Domain: eEF-1 alpha domain I, GTP-binding #status predicted <EF1> F;8-156/Domain: translation elongation factor Tu homology <ETU> F;14-21/Region: nucleotide-binding motif A (P-loop) F;153-156/Region: GTP-binding NKXD motif F;245-330/Domain: eEF-1 alpha domain II, tRNA-binding #status predicted <EF2> F;332-462/Domain: eEF-1 alpha domain III, tRNA-binding #status predicted <EF3> F;36,55,79,165,318/Modified site: N6,N6,N6-trimethyllysine (Lys) #status predicted F;301,374/Binding site: glycerylphosphorylethanolamine (Glu) (covalent) #status predicted EFHU1 Length: 462 January 14, 2002 19:49 Type: P Check: 5308 .. 1 MGKEKTHINI VVIGHVDSGK STTTGHLIYK CGGIDKRTIE KFEKE…… 401 IVDMVPGKPM CVESFSDYPP LGRFAVRDMR QTVAVGVIKA VDKKAAGAGK 351 GQISAGYAPV LDCHTAHIAC KFAELKEKID RRSGKKLEDG PKFLKSGDAA 451 VTKSAQKAQK AK

More Related