390 likes | 643 Views
Part I: Identifying sequences with …. Speaker : S. Gaj. Date 11-01-2005. Annotation. Annotation Best possible description available for a given sequence at the current time. How to annotate? Combining Alignment Tools Databases Datamining (scripts).
E N D
Part I:Identifying sequences with … Speaker : S. Gaj Date 11-01-2005
Annotation Annotation • Best possible description available for a given sequence at the current time. How to annotate? • Combining • Alignment Tools • Databases • Datamining (scripts) Background
Introduction Global alignment • Optimal alignment between two sequences containing as much characters of the query as possible. Ex: predicting evolutionary relationship between genes, … Local alignment • Optimal alignment between two sequences identifying identical area(s) Ex: Identifying key molecular structures (S-bonds, a- helices, …) Background
Introduction Basic Local Alignment Search Tool • Aligning an unknown sequence (query) against all sequences present in a chosen database based on a score-value. • Aim : Obtaining structural or functional information on the unknown sequence. BLAST
Programs • Different BLAST programs available • Usable criteria: • E-Value, Gap Opening Penalty (GOP),Gap Extension Penalty (GEP), … • Terms • Query Sequence which will be aligned • Subject Sequence present in database • Hit Alignment result. BLAST
A T C G A T A C G C C A G G - A T A C C | | | | | | | | | | | | | | | | | | | A T C G A T A C G C C A G G G A T A C C Common BLAST problems • BlastN Clone seq mRNA Sequencing Error BLAST • Solution: Low penalty for GOP and GEP = 1
Translation Problems • 6-Frame translation >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. BLAST LAL*PSSQH EGSHCSGA +1 ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...
Translation Problems • 6-Frame translation >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. +3 +2 * H S D L A V N M K A L I V L G BLAST L A L * P S S Q H E G S H C S G A +1 ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct... -1 -2 -3
Common BLAST problems intron exon Gene X Translation BLAST full mRNA Splicing mRNA
Coding region Non-coding region Common BLAST problems mRNA Clones derived from mRNA BLAST BlastX against protein sequence 3 possible hit-situations
Coding region Non-coding region Common BLAST problems Yields no protein hit Aligns with protein in 1 of the 6 frames. BLAST Part perfect alignment or
Introduction Primary database: • DNA Sequence (EMBL, GenBank, … ) • AminoAcid Sequence (SwissProt, PIR, …) • Protein Structure (PDB, …) Secondary database: • Derived from primary DB • DNA Sequence (UniGene, RefSeq, …) • Combination of all (LocusLink, ENSEMBL, …) Structure: • Flat file databases Databases
Primary Databases EMBL: • DNA Sequence • Human: 4.126.190.851 nucleotides in 292.205 entries • Clones, mRNA, (Riken) cDNA, … • New sequences can be admitted by everyone. • No curative check before admittance. Databases
Primary Databases SwissProt: • Amino Acid sequence • Human: • Contains protein information • SwissProt (EU) PIR (USA) • Crosslinks to most informative DB (PDB, OMIM) • Part of UniProt consortium. • Each addition needs validation by appointed curators. • Highly curated Databases
Secondary Databases TrEMBL: • Translated EMBL • Hypothetical proteins • After careful assessment SpTrEMBL SwissProt Databases
Secondary Databases UniGene: • Automated clustering of sequences with high similarity • Derived from GenBank / EMBL • 1 consensus-sequence • Species-specific Databases
Secondary Databases LocusLink: • Curated sequences • Descriptive information about genetic loci RefSeq: • Non-redundant set of sequences. • Genomic DNA, mRNA, Protein • Stable reference for gene identification and characterization. • High curation Databases
Database Quality? DNA mRNA Protein EMBL SwissProt Databases Submitter Submitter Curators Database Manager Database Manager
How to Annotate? • BlastN against random nucleotide DB • EST’s • BlastN against structured nucleotide DB (UniGene, RefSeq) • mRNA hits • Sometimes not annotated at all • Best information Databases
What do we have? • Probe sequence • Alignment Tools (e.g. BLAST) • Databases !?! What to choose ?!? Annotation
Possibilities? 1. Do it like everyone else does. 2. Make use of curative properties of certain databases Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID) Annotation
1st Approach - General • “Done by most array manufacturers” • Step-by-step approach: • BLAST sequences against nucleic database (preferably UniGene) • Extract high quality (HQ) hits (>95%) • For each HQ hit search crosslinks. • Find a well-described (SwissProt) ID for each sequence. Annotation Techniques
1st Approach - Concept Annotation Techniques
2nd Approach - General • “Make use of present database curation” • Other way around: • Use SwissProt to clean out EMBL • Result: “Cleaned” EMBL database with direct SP crosslinks • BLAST against cEMBL • Extract high quality alignment hits (>95%) • Convert EMBL ID to SP ID. Annotation Techniques
2nd Approach - Concept Annotation Techniques
Annotating Incyte Reporters Total: 13.497 cEMBL-approach: 2.898 (21,47%) SP-IDs DM approach: 10.013 (74,18%) UG-IDs in which M = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs Results
Annotating Incyte Reporters All reporters present on “Incyte Mouse UniGene 1” converted Total: 9.596 reporters Old annotation : 9.370 (97,6%) UG-IDs in which Non-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs; MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDs Datamining approach : 8.532 (88,9%) UG-IDs in which M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDs Custom EMBL-approach : 2.898 (30,2%) SP-IDs Results
Annotating Incyte Reporters Combined methods “Incyte Mouse UniGene 1” reporters Total: 9.596 reporters No annotation : 1.062 (11%) reporters Annotated with SP-ID : 5.895 (61,3%) reporters of which 2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method; 174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method Results
Conclusions • Annotation is much needed • Array sequences can point to different genes • Direct translation into protein not best option: • Sequencing errors • Addition or deletion of nucleotides • 6-Frame window • Public nucleotide databases are redundant. • Sequencing errors • Differences in sequence-length • Attachment of vector-sequence Conclusions
Questions? End