260 likes | 354 Views
Automatic and Reliable Functional Annotation of Proteins. Your data Uncharacterized Any kind of data Protein sequences Gene sequences etc. Our target: TrEMBL. The Target Database. Target. Collection of conditions Sequence patterns Profiles HMMs E.C. numbers Protein clusters
E N D
Your data Uncharacterized Any kind of data Protein sequences Gene sequences etc. Our target: TrEMBL The Target Database Target
Collection of conditions Sequence patterns Profiles HMMs E.C. numbers Protein clusters Example: PROSITE Pfam The External Database XDB Target
Search target Transfer annotation to target database Example:Look up E.C. number and add recommended enzyme name Direct Transfer XDB Target
Usually more than one external database is used Combine the different results Multiple Sources XDB Target
Conflicts • Contradiction • Inconsistencies • Synonyms • Redundancy
Use a translator to map XDB language to target language Translation XDB Target
Translation Examples • ENZYME TrEMBLCA L-ALANINE=D-ALANINECC -!- CATALYTIC ACTIVITY: L-ALANINE=CC D-ALANINE. • PROSITE TrEMBL/SITE=3,heme_ironFT METAL IRON • Pfam TrEMBL FT DOMAIN zf_C3HC4FT ZN_FING C3HC4-TYPE
Introduction a standard/reference database Must be: highly reliable well-curated Example:SWISS-PROT Automatic Translation XDB Standard Target
Use XDB to extract entries from standard database Example:Pfam:PF00509 HemagglutininHEMA_IAVI7/P03435HEMA_IANT6/P03436HEMA_IAAIC/P03437HEMA_IAX31/P03438HEMA_IAME2/P03439HEMA_IAEN7/P03440HEMA_IABAN/P03441HEMA_IADU3/P03442HEMA_IADA1/P03443HEMA_IADMA/P03444HEMA_IADM1/P03445HEMA_IADA2/P03446HEMA_IASH5/P03447 Extract Reference Entries Pfam SWISS-PROT TrEMBL
Extract Common Annotation 132 entries read131 ID HEMA_XXXXX125 DE HEMAGGLUTININ PRECURSOR. 6 DE HEMAGGLUTININ.131 GN HA130 CC -!- FUNCTION: HEMAGGLUTININ IS RESPONSIBLE FOR ATTACHING THE130 CC VIRUS TO CELL RECEPTORS AND FOR INITIATING INFECTION.125 CC -!- SUBUNIT: HOMOTRIMER. EACH OF THE MONOMER IS FORMED BY TWO125 CC CHAINS (HA1 AND HA2) LINKED BY A DISULFIDE BOND. 75 DR HSSP; P03437; 1HGD. 31 DR HSSP; P03437; 1DLH.131 KW HEMAGGLUTININ; GLYCOPROTEIN; ENVELOPE PROTEIN102 KW SIGNAL 1 KW COAT PROTEIN; POLYPROTEIN; 3D-STRUCTURE130 FT CHAIN HA1 CHAIN.107 FT CHAIN HA2 CHAIN.102 FT SIGNAL
Store the used pattern and the extracted common annotation in a separate database Store Common Annotation XDB Standard Target Common
Extract entries from target Add common annotation to the entries Add Annotation to Target XDB Standard Target Common
Modelling of the Rules • Definition of condition types • Definition of action types • Encoding the logic • Storage and retrieval of the rules • Version control • Monitoring the results
Formal Language for the Rules • #Comment#RULE RU000001#DATE 1997-04-23 • ?Condition?PSAC PS00057?SPOC PLANTA • !Action!SPDE L-LACTATE DEHYDROGENASE!ECNO 1.1.1.27
Implementation of Condition Types • Every condition type must be implemented • Example: Perl routine for ‘?PSAC’: has the protein a link to a given prosite entry?sub condition_PSAC { my $ac = shift; return /^DR PROSITE; $ac/m;}
Implementation of Action Types • Every action type must be implemented • Example: Add enzyme code to the entry.sub action_ECNO { my $ecno = shift; s/^DE.*$/$& (EC $ecno)/m;}orinsert into Trembl2Enzyme values (acc,ecno);
Encoding the Logic • Any logical expression likeaAND (bORc) BUTNOTdcan be written without brackets as aANDbANDNOTdORaANDcANDNOTd • Rules can be identifed by their conditions”a&b&-d|a&c&-d”
Extract conditions from XDB Group SWISS-PROT by conditions Extract common annotation Group TrEMBL by conditions Add common annotation to TrEMBL Automatic Annotation of TrEMBL ENZYME Pfam PROSITE SWISS-PROT TrEMBL RuleBase
Results: RuleBase • Source: PROSITE patterns • 262 rules • 597 conditions • 1099 actions • Result: • 2951 of 29330 new TrEMBL 5 entries • 1443 of 15078 new TrEMBL 6 entries • 9658 of 106330 existing TrEMBL 5 entries • 3254 of 140635 existing TrEMBL 6 entries
Discussion • Stable and reliable, successfully added 68000 lines to TrEMBL • Carefully set thresholds, therefore low coverage • Restricted language better than free text • Feed-back loop SWISS-PROT TrEMBL • Rules may be implemented in set-oriented language • Position specific annotation may be improved by alignments • Independent of hierarchy • Based on multiple entries
Where to get TrEMBL ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/
SWISS-PROT at EBI Rolf Apweiler Sergio Contrino Wolfgang Fleischmann Henning Hermjakob Viv Junker Fiona Lang Claire O'Donovan Michele Magrane Maria Jesus Martin Nicoletta Mitaritonna Steffen Moeller Stephanie Kappus Collaborators Amos Bairoch Alain Gateau Jean-Jacques Codani Keith Tipton MGD Flybase Pfam Network of > 200 external experts Credits