70 likes | 165 Views
Protein Prediction II Exercise. Exercise – Project Layout. G eneral remarks – recap: Report 60pts, Exam 40 pts , weekly presentations of each group, one bad presentation allowed, groups of 3-4 students Contact & Questions: pp2ex@rostlab.org only!
E N D
Exercise – Project Layout • General remarks – recap: Report 60pts, Exam 40 pts, weekly presentations of each group, one bad presentation allowed, groups of 3-4 students • Contact & Questions: pp2ex@rostlab.orgonly! • The exercise is taken from the CAFA competition • Prediction of HPO terms • HPO: Human phenotype ontology
Terms – Definitions and Explanations • Amino acids (aa): Building blocks for proteins, 20 different aa are found in proteins • Protein sequence: String of characters representing a sequence of amino acids (string from a 20 letter alphabet) • The protein sequence defines the protein structure and the protein function (within some limits) • Proteins sequences are stored in large publicly available repositories • One of the most well known repositories is UniProt (http://www.uniprot.org/) and its section Swiss-Prot • Besides the sequence these databases hold additional information about the protein, too
Ontology (in information science) • Ontology: An ontology represents knowledge as a set of concepts within a domain, using a shard vocabulary to denote types, properties and interrelationships of those concepts • Human Phenotype ontology (HPO): Set of concepts describing human appearing (shape, health, a.s.f.) • HPO concepts are hierarchically ordered, i.e. there is a “is-a” relation ship. • they are arranged in a tree-like fashion
Our competition • Proteins are annotated (described) with experimentally determined information • As time goes by: Proteins are associated with information about experimentally confirmed effects on the human phenotype • The associated term are taken form the Human Phenotype ontology • Experimental determination is slow and expensive • => we try to predict associated HPO terms for the yet un-annotated
More formal steps • Find a function that assigns a set of HPO terms T to a sequence s so that the number of false assignment is minimal and the number of true assignments is maximal • Remember: The true evaluation is done after submission when so far not annotated sequences get experimentally determined annotations
Tasks • Download files from www.rostlab.org/~richter/pp2_files.tgz • Get familiar with the provided files • Especially the column names (look for at Uniprot and HPO) • Read: http://biofunctionprediction.org/sites/default/files/IntroductionCAFA_pedja.pdf