LSM2241

LSM2241 P1 & P2 – Extra Discussion Questions

Features of major databases(PubMed and NCBI Protein Db)

Anatomy of PubMed Db

Epub ahead of print and journal impact factor • How to get impact factor of any journal: • Direct source – web of science database (free for NUS students) • In direct source, e.g blogs, sites etc (do Google search)

Anatomy of a PubMed record Extra information compared to slide 3

Demo on downloading articles AccessingOnlineJournalArticles.ppt for details

Anatomy of a Protein Db

Accession numbers and GenInfo Identifiers NM_000546.3 120407067 GI (or Geninfo Identifier) 120407067 NM_000546 Accession NM_000546.3 Version Popular data sources: dbj – DDBJ (DNA Data Bank of Japan database) emb – The European Molecular Biology Laboratory (EMBL) database prf – Protein Research Foundation database sp – SwissProt gb – GenBank pir – Protein Information Resource

Why do we need accession number and GI for one record? • 1) What is the difference between accession and GI? • 2) Why do we need these two when both seem to be • accession numbers?

Why do we need accession numbers and GIs? NM_000546 NM_000546 NM_000546 Sequence update Sequence update Sequence_v3 Sequence_v1 Sequence_v2 NM_000546.3 NM_000546.2 Version NM_000546.1 GI 4507636 120407067 8400737 Q1) Which revision will NCBI show if you were to search by the accession only without the version number?

Accession numbers • The unique identifier for a sequence record. • An accession number applies to the complete record. • - Accession numbers do not change, even if information in the record • is changed at the author's request. • Sometimes, however, an original accession number might become • secondary to a newer accession number, if the authors make a new • submission that combines previous sequences, or if for some • reason a new submission supercedes an earlier record.

GenInfo Identifiers • - GenInfo Identifier: sequence identification number • If a sequence changes in any way, a new GI number will be assigned • A separate GI number is also assigned to each protein translation • Within a nucleotide sequence record • A new GI is assigned if the protein translation changes in any way • GI sequence identifiers run parallel to the new accession.version • system of sequence identifiers

Version • - A nucleotide sequence identification number that represents a single, • specific sequence in the GenBank database. • If there is any change to the sequence data (even a single base), the • version number will be increased, e.g., U12345.1 → U12345.2, but • the accession portion will remain stable. • The accession.version system of sequence identifiers runs parallel to • the GI number system, i.e., when any change is made to a sequence, • it receives a new GI number AND an increase to its version number. • A Sequence Revision History tool • (http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi) • is available to track the various GI numbers, version numbers, and • update dates for sequences that appeared in a specific GenBank record

Anatomy of a Protein Db record

Fasta Sequence

Fasta Format • Text-based format for representing  nucleic acid sequences or peptide sequences (single letter codes). • Easy to manipulate and parse sequences to programs. Description line/row >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Sequence data line(s) Description line/row Sequence data line(s)

Fasta Format (cont.) • Begins with a single-line description, followed by lines of sequence data. • Description line • Distinguished from the sequence data by a greater-than (">") symbol. • The word following the ">" symbol in the same row is the identifier of the sequence. • There should be no space between the ">" and the first letter of the identifier. • Keep the identifier short and clear ; Some old programs only accept identifiers of only 10 characters. For example: > gi|5524211|Human or >HumanP53 • Sequence line(s) • Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature) • The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. Description line/row >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Sequence data line(s) Description line/row Sequence data line(s)

Amino acids

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z IUPAC One Letter Amino Acid Code Aspartic Acid Alanine Asparagi(N)e Asparagine ASx 22nd (Pyl) Pyrr(O)lysine ASx Proline Cysteine Aspar(D)ic Acid Arginine (Q)lutamine Glutamic Acid (R)ginine Glutamic Acid Serine (F)enylalanine Glutamine Threonine Glycine GLx 21st (Sec)Selenocysteine Histidine Lysine Valine Isoleucine Phenylalanine T(W)ptophan Tyrosine Lysine Tryptophan T(Y)rosine Leucine 21st (Sec) Selenocysteine GLx Methionine 22nd (Pyl) Pyrrolysine

Note

Advice • We highly recommend that you memorize the amino acid codes and their structures (covered in lectures on 3D structures) • Memorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes. • It is not compulsory that you memorize these for this module.

Features of major database (Gene Db)

Anatomy of Gene Db

Anatomy of a Gene Db record

A section of Gene Db record:Reference Sequences mRNA Accession number Protein Accession number

Questions

A) Problem Scenario Mr. Tan Yong Liang, Benjamin just joined Prof. Tan Tin Wee’s lab to do his PhD. He is to continue the project that was done by Dr. Asif M. Khan, who just graduated from Prof. Tan’s lab with PhD. To better understand the project that Dr. Khan did, Prof. Tan asked Benjamin to read all the papers that were published by him. Benjamin being a newbie to bioinformatics, needs your help in finding the papers. Can you help him answer the following questions?

A) Questions Q1. Which database(s) should he search? Q2. Help him formulate his search query based on the following available information: • Corresponding authors: Vladimir Brusic, Thomas J August, Tan Tin Wee • In one of the paper, Dr. Asif M. Khan’s name was incomplete: Asif Khan • Prof. August has a paper with Rosati M, which is also co-authored by someone with the same incomplete abbreviation as Dr. Khan Q3. On the results page, you will see two tabs, “All” and “Review”. What is the difference between them? Q4. Is Pubmed comprehensive?

cancer 15 records total p53 12 records total Both terms: 5 records B) Questions ? p53 AND cancer: returns how many records p53 OR cancer: returns how many records p53 NOT cancer: returns how many records ? ?

C) Questions Q1) When you perform a search for P53 in the protein database, you observe 4 tabs on top, namely All, Bacteria, Refseq and Related Structures. What do you think is the difference between “RefSeq” and “All” tab?

D) Questions Q1) Using the skills you have learned and databases that have been introduced to you, can find out where in the p53 protein is the Nuclear Localization Signal located? i.e., what is the sequence range? Q2) Does the entry belong (P04637) to Refseq database? (Hint: analyze the alphanumeric identifiers of the entry)

Summary of items covered today • Intro to Practicals – logistics • Search strategies exercise and discussion • Explored basic bioinformatics resources – exercise and discussion • Tips/Tricks to improve productivity • “Libproxy1” suffix shortcut • WizFolio

LSM2241

LSM2241

Presentation Transcript