1 / 38

Part I: Identifying sequences with …

Part I: Identifying sequences with …. Speaker : S. Gaj. Date 11-01-2005. Annotation. Annotation Best possible description available for a given sequence at the current time. How to annotate? Combining Alignment Tools Databases Datamining (scripts).

roger
Download Presentation

Part I: Identifying sequences with …

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Part I:Identifying sequences with … Speaker : S. Gaj Date 11-01-2005

  2. Annotation Annotation • Best possible description available for a given sequence at the current time. How to annotate? • Combining • Alignment Tools • Databases • Datamining (scripts) Background

  3. Microarrays

  4. Introduction Global alignment • Optimal alignment between two sequences containing as much characters of the query as possible. Ex: predicting evolutionary relationship between genes, … Local alignment • Optimal alignment between two sequences identifying identical area(s) Ex: Identifying key molecular structures (S-bonds, a- helices, …) Background

  5. Introduction Basic Local Alignment Search Tool • Aligning an unknown sequence (query) against all sequences present in a chosen database based on a score-value. • Aim : Obtaining structural or functional information on the unknown sequence. BLAST

  6. Programs • Different BLAST programs available • Usable criteria: • E-Value, Gap Opening Penalty (GOP),Gap Extension Penalty (GEP), … • Terms • Query Sequence which will be aligned • Subject Sequence present in database • Hit Alignment result. BLAST

  7. A T C G A T A C G C C A G G - A T A C C | | | | | | | | | | | | | | | | | | | A T C G A T A C G C C A G G G A T A C C Common BLAST problems • BlastN Clone seq mRNA Sequencing Error BLAST • Solution: Low penalty for GOP and GEP = 1

  8. Translation Problems • 6-Frame translation >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. BLAST LAL*PSSQH EGSHCSGA +1 ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct...

  9. Translation Problems • 6-Frame translation >embl|J03801|HSLSZ Human lysozyme mRNA, complete cds with an Alu repeat in the 3' flank. +3 +2 * H S D L A V N M K A L I V L G BLAST L A L * P S S Q H E G S H C S G A +1 ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggct... -1 -2 -3

  10. Common BLAST problems intron exon Gene X Translation BLAST full mRNA Splicing mRNA

  11. Coding region Non-coding region Common BLAST problems mRNA Clones derived from mRNA BLAST BlastX against protein sequence 3 possible hit-situations

  12. Coding region Non-coding region Common BLAST problems Yields no protein hit  Aligns with protein in 1 of the 6 frames. BLAST  Part perfect alignment or

  13. Part II: Databases and annotation

  14. Introduction Primary database: • DNA Sequence (EMBL, GenBank, … ) • AminoAcid Sequence (SwissProt, PIR, …) • Protein Structure (PDB, …) Secondary database: • Derived from primary DB • DNA Sequence (UniGene, RefSeq, …) • Combination of all (LocusLink, ENSEMBL, …) Structure: • Flat file databases Databases

  15. Primary Databases EMBL: • DNA Sequence • Human: 4.126.190.851 nucleotides in 292.205 entries • Clones, mRNA, (Riken) cDNA, … • New sequences can be admitted by everyone. • No curative check before admittance. Databases

  16. Primary Databases SwissProt: • Amino Acid sequence • Human: • Contains protein information • SwissProt (EU)  PIR (USA) • Crosslinks to most informative DB (PDB, OMIM) • Part of UniProt consortium. • Each addition needs validation by appointed curators. • Highly curated Databases

  17. Secondary Databases TrEMBL: • Translated EMBL • Hypothetical proteins • After careful assessment  SpTrEMBL  SwissProt Databases

  18. Secondary Databases UniGene: • Automated clustering of sequences with high similarity • Derived from GenBank / EMBL • 1 consensus-sequence • Species-specific Databases

  19. Secondary Databases LocusLink: • Curated sequences • Descriptive information about genetic loci RefSeq: • Non-redundant set of sequences. • Genomic DNA, mRNA, Protein • Stable reference for gene identification and characterization. • High curation Databases

  20. Database Quality? DNA mRNA Protein EMBL SwissProt Databases Submitter Submitter Curators Database Manager Database Manager

  21. How to Annotate? • BlastN against random nucleotide DB • EST’s • BlastN against structured nucleotide DB (UniGene, RefSeq) • mRNA hits • Sometimes not annotated at all • Best information Databases

  22. Microarrays

  23. Part III: Annotation Techniques

  24. What do we have? • Probe sequence • Alignment Tools (e.g. BLAST) • Databases !?! What to choose ?!? Annotation

  25. Possibilities? 1. Do it like everyone else does. 2. Make use of curative properties of certain databases Goal: Annotate as many genes with as much information as possible (e.g. SwissProt ID) Annotation

  26. 1st Approach - General • “Done by most array manufacturers” • Step-by-step approach: • BLAST sequences against nucleic database (preferably UniGene) • Extract high quality (HQ) hits (>95%) • For each HQ hit search crosslinks. • Find a well-described (SwissProt) ID for each sequence. Annotation Techniques

  27. 1st Approach - Concept Annotation Techniques

  28. 2nd Approach - General • “Make use of present database curation” • Other way around: • Use SwissProt to clean out EMBL • Result: “Cleaned” EMBL database with direct SP crosslinks • BLAST against cEMBL • Extract high quality alignment hits (>95%) • Convert EMBL ID to SP ID. Annotation Techniques

  29. 2nd Approach - Concept Annotation Techniques

  30. Annotating Incyte Reporters Total: 13.497 cEMBL-approach: 2.898 (21,47%) SP-IDs DM approach: 10.013 (74,18%) UG-IDs in which M = 4.723 (34,9%) SP-IDs ; MR = 5.147 (38,1%) SP-IDs; MRH = 6.641 (49,2%) SP-IDs Results

  31. Annotating Incyte Reporters All reporters present on “Incyte Mouse UniGene 1” converted Total: 9.596 reporters Old annotation : 9.370 (97,6%) UG-IDs in which Non-existing UG-IDs = 5.713 (59,5%); M = 1.939 (20,2%) SP-IDs; MR = 2.096 (21,8%) SP-IDs; MRH = 2.582 (26,9%) SP-IDs Datamining approach : 8.532 (88,9%) UG-IDs in which M = 4.145 (43,2%) SP-IDs ; MR = 4.499 (38,1%) SP-IDs; MRH = 5.576 (60,1%) SP-IDs Custom EMBL-approach : 2.898 (30,2%) SP-IDs Results

  32. Annotating Incyte Reporters Combined methods “Incyte Mouse UniGene 1” reporters Total: 9.596 reporters No annotation : 1.062 (11%) reporters Annotated with SP-ID : 5.895 (61,3%) reporters of which 2.184 (22,7%) identical SP-IDs; 532 (5%) reporters with improved SP-IDs by EMBL-method; 174 (1,8%) reporters with different mouse SP-IDs; 5 reporters found only by EMBL-method Results

  33. Conclusions • Annotation is much needed • Array sequences can point to different genes • Direct translation into protein not best option: • Sequencing errors • Addition or deletion of nucleotides • 6-Frame window • Public nucleotide databases are redundant. • Sequencing errors • Differences in sequence-length • Attachment of vector-sequence Conclusions

  34. Questions? End

More Related