730 likes | 1.64k Views
Tools in bioinformatics. Fall 2009-10. Goals. Overview. To provide students with practical knowledge of bioinformatics tools and their application in research. Prerequisites. The course “Introduction to bioinformatics”
E N D
Tools in bioinformatics Fall 2009-10
Goals Overview • To provide students with practical knowledge of bioinformatics tools and their application in research Prerequisites • The course “Introduction to bioinformatics” • Familiarity with topics in molecular biology (cell biology, biochemistry, and genetics) • Basic familiarity with computers & internet
Course website Administration http://ibis.tau.ac.il/intro_bioinfo/tools.html
Administration Classes: A class will be given every two weeks There are three class groups:Sunday 16:00-18:00Monday 12:00-14:00 Monday 14:00-16:00 Location: Computer classroom Sherman 03
Administration Teachers: • Nimrod Rubinstein rubi@post.tau.ac.il (Sundays) • Daiana Alaluf daianaal@post.tau.ac.il (Mondays I) • Osnat Penn penn@post.tau.ac.il (Mondays II) • Reception hours:Email your instructor any question at any time or set an appointment (Britania 405, 6409245)
Requirements • Assignments – 50% of final grade (compulsory) • Assignments include class and home works: • Class works are planned to be completed during the lesson and handed in at the end of it. They will be checked but not graded. • Home works should be handed in the following lesson (two weeks after their hand out). They will be checked and graded. • Final project – 50% of final grade When emailing your instructor (a question, your assignment, or whatever) please state in the “Subject” field: “Tools in Bioinfo”, IDs, CW/HW number (if relevant)
What’s in a database? • Sequences – genes, proteins, etc… • Full genomes • Expression data • Structures • Annotation – information about genes/proteins:- function- cellular location- chromosomal location- introns/exons- phenotypes, diseases • Publications
NCBI and Entrez • One of the most largest and comprehensive databases belonging to the NIH (national institute of health.The primary Federal agency for conducting and supporting medical research in the USA) • Entrez is the search engine of NCBI • Search for :genes, proteins, genomes, structures, diseases, publications, and more http://www.ncbi.nlm.nih.gov
PubMed: NCBI’s database of biomedical articles Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags
Example • Retrieve all publications in which the first author is:Davidovich C and the last author is: Yonath A
Using limits Retrieve the publications of Yonath A, in the journals: Nature and Proc Natl Acad Sci U S A., in the last 5 years
Google scholar http://scholar.google.com/
GenBank: NCBI’s gene & protein database • GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) • Holds ~106.5 billionbases of ~108.5 millionsequence records (Oct. 2009)
Searching NCBI for the protein human CD4 Search demonstration
Using field descriptions, qualifiers, and boolean operators • Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] • List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers • Boolean Operators:ANDORNOT Note: do not use the field Protein name [PROT], only GENE!
RefSeq • Subcollection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
Fasta format header description ID/accession > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens]MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI sequence Save accession numbers for future use (makes searching quicker):RefSeq accession number: NP_000607.1 24
Downloading 25
Swissprot • A protein sequence database which strives to provide a high level of annotation regarding:* the function of a protein* domains structure* post-translational modifications* variants • One entry for each protein http://www.expasy.ch/sprot
GenBank Vs. Swissprot Swiss-Prot results GenBank results
PDB: Protein Data Bank • Main database of 3D structures of macromolecules • Includes ~61,000 entries (proteins, nucleic acids, complex assemblies) • Is highly redundant http://www.rcsb.org
Human CD4 in complex with HIV gp120 PDB ID 1G9M gp120 CD4
GeneCards • All-in-one database of human genes (a project by the Weizmann institute) • Attempts to integrate as many as possible databases, publications, and all available knowledge http://www.genecards.org
Organism specific databases • Model organisms have independent databases: HIV database http://hiv-web.lanl.gov/content/index
Summary • General and comprehensive databases: • NCBI, EMBL • Genome specific databases (to be discussed): • UCSC, ENSEMBL • Highly annotated databases: • Human genes • Genecards • Proteins: • Swissprot, RefSeq • Structures: • PDB
As important: • Google (or any search engine)
And always remember: • RT(F)M -Read the manual!!! (/help/FAQ)
Gene Ontology • Strives to provide consistent descriptions of gene products obtained from different databases • GO annotations include three hierarchicalontologies of gene products: • cellular component(s) – the environment in which the gene product functions • biological processe(s) – the biological program/pathway in which the gene product is involved • molecular function(s) – the elemental activities of the gene product • E.g., cytochrome c: • cellular components: mitochondrial matrix and mitochondrial inner membrane • biological processes: oxidative phosphorylation and induction of cell death • molecular functions: oxidoreductase activity
. .
. . . .
Enrichment analysis Query set Reference set N n k K Total – N genes Function f – K genes Total – n genes Function f – k genes Is k/n > K/N, significantly ???
Statistical significance testing Problem formulation: In a group of N genes there are K “special” ones If we sample n genes out of N (without replacement), and found k “special” ones, would that be considered a random outcome? Mathematically, we use the hypergeometric distribution to compute the probability of obtaining k or more “special” ones in a sample of n
Materials & Methods 21,121 siRNA knockdown assays, literally covering the entire coding-sequence part of the genome
Results 273 HIV-dependency factors (HDFs) were discovered Biological processes