BioInformatics Consultation Practice 1 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22

BioInformatics Consultation Practice 1 Gábor Pauler, Ph.D. Tax.reg.no: 63673852-3-22 Bank account: 50400113-11065546 Location: 1st Széchenyi str., 7666 Pogány, Hungary Tel: +36-309-015-488 E-mail: pauler@t-online.hu

Content of the Practice • Sequence Databases • Basic terms and types • File formats • Nucleotid/ Amino acid code tables • FASTA • EMBL • Content • Feature table • Sequence description • Relational databases • Text based search • Sequence searches • Home Assignment 1: Sequence search • References

Sequence Databases: Basic terms and types 1 WARNING: In our discussion we assume that basic terms of molecular genetics are known. If not, please check out: BioinfoNotes • Definition: • They are for complex description of nucleotide / protein sequences, with all auxiliary information attached • Data sources: • Primary (Elsődleges): • RNA/DNA/Protein sequence samples sequenced and submitted by reserachers • They contain 1-2 genes in average (1.5-3Kbases) • Genom projects (Genomprojektek): human, mouse, etc: • They prepare complete gene map with Chromosome walking (Kromoszómán lépkedés) method • They try to identify the following types of „navigation signals” in the genome: • Expressed Seqence Tags, EST (Expresszálódó szekvencia jelek) • Connection sequences of protein factors controling gene expression • Genomic Survey Sequences, GSS (Genomikus rész-szekvenciák) • Mobile genetic elements causing mutation: • Transposable-DNA, P-elements, etc. • Moreover, all of their mutated form • Environmental DNA samples • DNA fragments collected from ocean water: it allows to research genetic material of such microorganism, which cannot be breeded in laboratory (eg. sulphur consuming bacteria at sea bottom vulcanic wells) • Secondary (Másodlagos): • Protein sequences translated back from cDNA • Nucleotid sequences translated back from Amino Acid sequence of a protein using most frequent codon usage (Kodon használat) of Amino Acids in the given organism • Distribution of data sources by races:

Sequence Databases: Basic terms and types 2 • Sequence databases by content checking: • Redundant (Redundáns) •  It can contain repeating sequences, they are not checked •  Therefore, it can grow faster • Non-Redundant, NR (Nem redundáns) •  There is a sequence matching test at submitting ensuring that a sequence is stored only once,  but therefore it is slower growing • Types of sequence databases and their histors: • Major nucleotide databases: • 1980 EBI Heidelberg – Hinxton, UK: EMBL (http://www.ebi.ac.uk/embl/) • 1980 NCBI Bethesda, USA: GenBank (http://www.ncbi.nlm.nih.gov/genbank/) • 1980s CIB Japan: DDBJ (http://www.ddbj.nig.ac.jp/searches-e.html) • They are cross-synchronized • They are doubling in approximately every two years (in 2009 Dec: 173M records containing 271GBases) • Their Primary key (Elsődleges kulcs) is Accession Number, AC (Elérési szám), a unique, compulsory-to-fill ID of sequences they can be referenced with • Major protein databases: • 1986 SIB Sweden: Swiss-PROT (http://www.ebi.ac.uk/uniprot/) • This is strictly superwised by curators, sequence submitting is not automatic • So this original database was relatively small (600K basic sequences) however quite a reliable • It contains primary sequences and also cDNA translated ones • It usually contains the following auxiliary data also • Function of protein • Secontary a and tertiary structure • Remarkable protein domains (Domének) • Homolog (Homológ) parts: they have the same function • TrEMBL (http://www.ebi.ac.uk/uniprot/ ): Automatically translated proteins from cDNA stored in EMBL. It is not superwised, so less reliable and often without auxiliary explanation, but grows faster than Swiss-PROT • UniProt (http://www.ebi.ac.uk/uniprot/): Unification of Swiss-PROT and TrEMBL

Sequence Databases: File formats: Nucleotid/ Amino acid code tables • In all formats, nucletide(Nukleotid) sequences are described by series of 1-char codes: • There are 2 standards, GCG (more frequent) and Sanger (almost extinct): they denote the 4 basic nuclein acids identically (A,G,C,T/U) but differ in how to denote uncertainly sequenced nucleotide positions (Pozíció) • Amino acid (Aminosav) sequences can be described by series of 1 or 3 char codes: • 1-char is used more • frequently, as it is shorter • 3-char is longer but has • the same lenght as the co- • ding nucleotide sequence, • as Triplet (Triplet) of 3 nuc- • letiodes called Codon (Ko- • don) codes 1 amino acid • Codons can also be des- • cribed in masked format, u- • sing GCG-coded wildcards

Sequence Databases: File formats: FASTA, EMBL, GenBank, Relational Sequence SeqID Accession Name Author Title Sequence Lenght Submitted Organism Gene GeneID Name Position Lenght Discovered Organism GeneLocation LocationID Comment PositionFrom PositionTo SeqID GeneID IntrExon IntrExonID PositionFrom PositionTo LocationID • FASTA format: a simple text file format describing nucleotide/protein sequences: • It starts with „>” character and the description of the sequence • In the next line the sequence starts with 1-char codes and it goes uninterrupted by any other character (Enter, Space, Tab, numbers, etc.) until the end of file: •  It does not contain any auxiliary info •  It is a common interchange format among more difficult formats and used for I/O of bioinfo software • Formats of complex sequence databases (EMBL, GenBank): • Historically: these contained multi-line text files as„records”, where: • Number and lenght of lines is not fixed (can be ≤80 characters), • Function of lines are denoted by 2 prefix characters, eg.: ID-internal ID, AC-accession# •  They give complex description, not just the sequence, with many auxiliary info •  They can be read and edited by any word processor •  However, in their original format, they reflect the technical level of the mid 1980s mainframe systems and DID NOT FORM DATABASE in modern terms: • As there can be variable number of lines in one record, and lines has variable lenght, the records can be searched only sequentially (Szekkvenciálisan): you have to read through the whole, lengthy record to capture any detail  Damn slow!!! • Description of auxiliary data was sometimes not fully standardized • Modern databases: Therefore, modern sequence database servers use these text file records ONLY AT I/O, and data is stored as Relational database (Relációs adatbázis): • Non-fixed lenght data structure of records (eg. one sequence can have many genes which can have many introns/exons, internal 1:many relationships) • Is broken up into several database tables (Adatbá- zis tábla) consisting fixed number of data fields(Me- ző)of given types (Text, Number, Date, etc.) and unlimited number of records(Rekord) can be read/ write and search incredibly fast (0.5-1M recs/sec) • Tables are connected with 1:manyrelations (Relá- ció) consist foreign key(Idegen kulcs) fields refe- rencing to primary key field of other table to describe the original non-fixed lenght data structure: >SequenceName,SeqID,CommentText ggatccagtg… ID ATCSCH42 standard; AC X51799; FT Exon 2770..2899; FT Intron 2900..2980; SQ Sequence 6981 BP; ggatccagtg…

Sequence Databases: File formats: EMBL content 1 ID ATCSCH42 standard; DNA; PLN; 6801 BP. XX AC X51799; XX SV X51799.1 XX DT 16-MAR-1990 (Rel. 23, Created) DT 11-MAR-1999 (Rel. 59, Last updated, Version 3) XX DE Arabidopsis thaliana cs/ch-42 gene for a chloroplast XX KW chlorata locus; chloroplast protein; unidentified reading XX OS Arabidopsis thaliana (thale cress) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; OC euphyllophytes; Spermatophyta; Magnoliophyta; OC core eudicots; Rosidae; eurosids II; Brassicales; OC Arabidopsis. XX RN [1] RP 1-6801 RA Mayerhofer R.; RT ; RL Submitted (06-FEB-1990) to the EMBL/GenBank RL Mayerhofer R., MPI fuer Zuechtungsforschung, RL Koeln 30, F R G. XX RN [2] RP 1-6801 RA Koncz C., Mayerhofer R., Koncz-Kalman Z., Nawrath C.,. RT Isolation of a gene encoding a novel chloroplast RT in Arabidopsis thaliana; RL EMBO J. 9:1337-1347(1990). XX DR AGIS; X51799; 17-SEP-1999. DR MENDEL; 12580; Arath;1780;12580. DR SWISS-PROT; P16127; CHLI_ARATH. DR SWISS-PROT; P16128; YCCH_ARATH. First, lets see the text file record of EMBL: (Genbank slightly differs but has same logic) • Annotations part: Record identification info: • ID Primary key (Elsődleges kulcs) in the given database • AC Accession number (Elérési szám) syn-chornized ID through more databases (EMBL, GenBank, DDBJ), there can be more ACs for one sequence • SV Sequence version (Szekvencia-változat) • DT Date (Dátum) of last submit/modify • DE Description (A szekvencia rövid leírása) • KW Keyword (Kulcsszó), there can be more • OS Organism species (Szekvenc.forrás-faj) • OC Organism classification (Szekvencia forrás- taxonómiai besorolás) • OG Organelle (Forrás-szervezet) • References part: This describes in which publications the sequence was published (can contain hyperlinks also): • RN Reference number (Száma), • RC Reference comment (Megjegyzés), • RP Reference positions (Oldalszám), • RX Reference cross-reference (Eredeti v. kereszthivatkozás), • RA Reference authors (Szerzők), • RT Reference title (Cím), • RL Reference location (Folyóirat), • DR Database cross-reference (Adatbázis), • CC Comments (Általános megjegyzés). • Empty row for spacing the file: XX

Sequence Databases: File formats: EMBL content: Feature table FH Key Location/Qualifiers FH FT source 1..6801 FT /chromosome=4 FT /db_xref=taxon:3702 FT /organism=Arabidopsis thaliana FT /strain=columbia FT /map=39.4 FT CDS complement(<1..872) FT /db_xref=MENDEL:12580 FT /db_xref=SWISS-PROT:P16128 FT /note=ORF (291 AA) FT /protein_id=CAA36096.1 FT /translation=MLCFSASRLDDFDLGSSPPKK FT DLDFGLDLPITRQVPSKANTDVQAKASAEK FT FEAVESPQGSRKKASQTHTMCVQPQSVD FT SEIAHIAVNRETSPDIHELCRSGTKEDCPID FT HLCSDKIEHQQEEMGTDTQAEIQDNTKGA FT DLSEKLPLDP FT precursor_RNA 2770..4382 FT /note=primary transcript FT mRNA join(2770..2899,2981..3095,3205..4382) FT /note=exon 1 FT CDS join(2796..2899,2981..3095,3205..4260) FT /db_xref=SWISS-PROT:P16127 FT /note=chloroplast protein FT /protein_id=CAB38561.1 FT /translation=MASLLGTSSSAIWASPSLSSPS FT IQIRPKKNRSRYHVSVMNVATEINSTEQVV FT LNVIDPKIGGVMIMGDRGTGKSTTVRSLVD FT RVEKGEQVPVIATKINMVDLPLGATEDRVC FT YVDEVNLLDDHLVDVLLDSAASGWNTVER FT RFGMHAQVGTVRDADLRVKIVEERARFDS FT QIDRELKVKISRVCSELNVDGLRGDIVTNRA FT RLRKDPLESIDSGVLVSEKFAEIFS FT exon 2770..2899 FT /number=1 FT intron 2900..2980 FT /number=1 FT exon 2981..3095 FT /number=2 FT intron 3096..3204 FT /number=2 FT exon 3205..4382 FT /number=3 FT polyA_signal 4378..4382 • It contains analysis of special features in the sequence: • FH feature table header (fejlécsor) • FT feature table data (adatsor) • FT Source rows: describe basis coordinates of a feature in a hierarchy of Organistaion>Chromoso-me>Structural part in (StartCoord1..EndCoord1, StartCoord2..EndCoord2) format • FT CDS rows (Master): describe the protein product translated from feature: • At which protein database record it can be found • 1-char sequence of amino acids in protein in the direction from N-terminal end to COOH-terminal end (It complies with 5’-3’ direction of the coding DNA strand (DNS szál) • FT Precursor_RNA: describe start/end coordinates ofinmature mRNA (mRNS) • FT mRNA: describe start/end coordinates ofmature mRNA after splicing • FT CDS rows (Additional): describes protein products of alternative splicing, in the same format as above • FT exon: describe exon’s consecutive number in spicing and start/stop coordinates • FT intron: describe Introns’s consecutive number in spicing and start/stop coordinates • FT FeatureName: start/stop coordinates of any other special feature • Promoter: start/stop coordinates of promoter pats of genes • Domain: Domains (Domének) in protein products • Mutation: altered sequences of mutant versions

Sequence Databases: File formats: EMBL content: Sequence description SQ Sequence 6801 BP; 2093 A; 1242 C; 1374 G; 2092 T; 0 other; ggatccagtg gtagcttttc actcaaatct tgtaccttgg cagtttggct tgtacgagtg 60 cctggtgata ttttgcctga gagggttgtt agagaatgtc cagcatctga gttatacagt 120 gctcctttag tgttatcctg tatttctgcc tgagtgtctg tacccatttc ttcctgttga 180 tgttctatct tgtctgaaca taaatgagat gagatgcttg gtgaagtctg • SQ: Sequence header (Szekv. fejléc): • Frequency of nucleine/amino acids in nucleotide/protein sequence • 60,120, 180…:Sequence rows: • 6×10 base/aminoacids with one row, tabbed with Space characters • Variations in storing different type of sequences: • DNA: Chromosomal DNA: It is described by the upper strand (Felső szál) in 5’-3’ direction with 1-char GCG codes (if all genes are coded on the upper strand, as this is the most frequent case) • mRNA: Messenger RNA: almost same as above, except that Timin (T) is replaced with Uracil (U) • cDNA: Coding DNA from mature mRNA: This is stored as RNA sequence also with (G,C,U,A) codes • tRNA: Transfer RNA: It is stored as non-modified colinear sequence of DNA • Protein: it is described with 1-char amino acid codes in N-terminal.. COOH-terminal direction. • Don’t forget that meaning of codons at translation show slight variations in different organisms, and even at their mithocodrial DNA (Motokondriális DNS) • Differing from Standard Codon Table • Normally, for a biologist it is enogh to know the content of the text file records and will never see how they are really stored • However if somebody seriously tampers with bioinformatics software, should know: • How they are store data in a modern relational database behind the software, • How structure of a relational database can be described, • How to import/export data directly to a relational database if necessary

Sequence databases: Relational databases: Entity Relationship Diagrams Address AddressID Door Floor Building HouseNum Street StreetType LinePhone Fax Zip Modifier Modified Status EntityName EntityNameID Text Integer Fraction Binary Date Time Image Sound Movie ReqForeignKey OptForeignKey Modifier Modified Status MasterEntity MasterID MasterName Invoice InvoiceID InvoiceNum ItemCount NetTotalVal GrossTotal VATTotal Paid IssueDate IssueTime SellerID BuyerID SalesPersID Modifier Modified Status Item ItemID Quantity NetVal GrossVal InvoiceID BarCode Modifier Modified Status StreetType StreetType TypeName VAT VATCode VATPercent MeasUnit MeasUnit UnitName Country Country CntName Zip Zip City Country LegalFormat LegalFormat FormatName ITJ ITJCode Description VATCode PersProdSales PersProdSalID SumOfSales SalesPersID BarCode Product BarCode Description UnitPrice VATCode ITJCode MeasUnit Seller SellerID SellerName LegalFormat SellerTaxReg CellPhone E-mail URL AddressID Buyer BuyerID FirstName LastName CellPhone E-mail AddressID SalesPers SalesPersID FirstName LastName CellPhone E-mail AddressID Entity relationship diagram (Egyedkapcsolati diagram) (ERD): is used to represent structure of a Relational Database System (RDS) • Tables are rounded corner boxes with Entity-Name at the top. Blue background denotes codetable/master entities with minimal data change in time, yellow denotes relational/trans-action entities: rapid, irrevocable data changes in time • Fields are listed with their data type icons: ( , , , , , , , , ) and names: italic means op-tional-, normal means required-, bold means auto-filled attribute • Data fields are purple, primary keys are orange prompted by ( ), foreign keys are olive prompted by( ), auto-filled system log-ging attributes are black • 1:many relations are denoted by ( ) con-necting primary-and foreign keys: • Independent side of relation is denoted on ERD with dashed line ( ), depen- dent with solid ( ) • Its referential integrity check is denoted with ( ), unswitchability with ( ), cascade delete disabled with ( ). • Lets see a simple example of invoicing:

Sequence databases: Relational databases: Examples • There are alternate ERD symbols also (eg. when  denotes many:1 relation-ship: foreign keyprimary key) but it is easy to shift between them once relational logic is understood • We put here some examples of ERD database designs of current bioinfo software to show their complexity (see references for further details)

Text-based search Click We use this if we do not know the accession number (AC) of the sequence, but we know some auxiliary information (organism, author, journal, etc.): • Entrez (http://www.ncbi.nlm.nih.gov/sites/gquery) this is the multi-search engine of NCBI (National Center of Biotechnology Information). At the Main screen: • We can launch general search (slower), or • Select the specific database to search (faster): • PUBMed: search among publication abstracts and references. Its user interface clicking at Advanced search: • Search box: keyword search with AND, OR,NOT logic operators and * wildcards • Search builder: easy-to-use graphic interface to build more difficult search terms pressing Add to search button: • Searches can be more focused if we give the database fiield (ID..CC) where keywords should be searched • We can do the same in search box putting field nam in brackets: [Title] • Search history: List of our recent searches. Items can be combined with logic operators in the following format: #1 AND #2 to build even more complex searches • PubMed Central: we can search here free full text publications on similar user interface Other databases recommended for full text search: • UniProt: for proteins: http://www.ebi.ac.uk/uniprot/ • PIR: Mainly for proteins: http://pir.georgetown.edu/ • SRS: EBI multisearch interface: http://srs.ebi.ac.uk • DB-GET:Japanese multi:http://www.genome.jp/dbget/ Click Click Click Click Click Click Click Click

Sequence searches We use this if we know the accession number (AC) • Unique (egyedi): If we are looking for single sequence • Entrez (http://www.ncbi.nlm.nih.gov/sites/gquery ): • Main screen: • Nucleotide search: we get similar search builder than at text based search • Results are retrieved in FASTA or EMBL or GenBank format • Another sites for sequence search: • GeneBank: http://www.ncbi.nlm.nih.gov/genbank/ • Uni-Prot: http://www.ebi.ac.uk/uniprot/ • EMBL: http://www.ebi.ac.uk/embl/ • OMIM: http://www.ncbi.nlm.nih.gov/omim • RefSeq: http://www.ncbi.nlm.nih.gov/refseq/ • PDB: http://www.pdb.org/pdb/home/home.do • Pfam: http://pfam.sanger.ac.uk/ • SCF: http://www.sciencechatforum.com • ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/index.html • BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi • Batch (Kötegelt): If we look for multiple sequences: • Entrez (http://www.ncbi.nlm.nih.gov/sites/batchentrez) • Database: select the sequence database • File(Tallózás): name and path of a text file containing ACs of the requested sequences (Only 1 AC in 1 line!) • Retrieve: Show results in FASTA format concatenated after one other Click Click Click Click Click >X000328 ATTGCGCTATCGTATAGCAT >X000329 ATTGCGCTATCGTATAGCAT >X000330 ATTGCGCTATCGTATAGCAT X000328 X000329 X000330 Click Click

Home Assignment 1: Sequence search • A, Search for publications and sequences related to Yeast (Élesztő) mitochondrial DNA! (2.5pts) • B, Download the following sequence in EMBL format: X51799 What is the total lenght of its introns? (2.5pts)

References • Text search: • NCBI Enterez: http://www.ncbi.nlm.nih.gov/sites/gquery • UniProt: http://www.ebi.ac.uk/uniprot/ • PIR: http://pir.georgetown.edu/ • SRS: http://srs.ebi.ac.uk • DB-GET: http://www.genome.jp/dbget/ • Sequence search: • NCBI Enterez: http://www.ncbi.nlm.nih.gov/sites/gquery • NCBI Enterez batch: http://www.ncbi.nlm.nih.gov/sites/batchentrez • GeneBank: http://www.ncbi.nlm.nih.gov/genbank/ • Uni-Prot: http://www.ebi.ac.uk/uniprot/ • EMBL: http://www.ebi.ac.uk/embl/ • OMIM: http://www.ncbi.nlm.nih.gov/omim • RefSeq: http://www.ncbi.nlm.nih.gov/refseq/ • PDB: http://www.pdb.org/pdb/home/home.do • Pfam: http://pfam.sanger.ac.uk/ • SCF: http://www.sciencechatforum.com • ClustalW: http://www.ebi.ac.uk/Tools/clustalw2/index.html • BLAST: http://blast.ncbi.nlm.nih.gov/Blast.cgi • Examples of Entity Relationship Diagrams of bioinfo databases: • NCBI Lipid Onthology: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1669719/ • DNA Alignment editor: http://www.biomedcentral.com/1471-2105/9/154 • Biomed Data Warehouse: http://www.biomedcentral.com/1471-2105/7/170

BioInformatics Consultation Practice 1 Gá bor Pauler , Ph.D. Tax.reg.no: 63673852-3-22