130 likes | 211 Views
Mining hidden information from your 454 data using modular and database oriented methods. Joachim De Schrijver. Overview. Short introduction on 454 sequencing Variant Identification pipeline Possibilities of a DB oriented pipeline Examples Coverage Improving PCR Fast Q assessment
E N D
Mining hidden information from your 454 data using modular and database oriented methods Joachim De Schrijver
Overview • Short introduction on 454 sequencing • Variant Identification pipeline • Possibilities of a DB oriented pipeline • Examples • Coverage • Improving PCR • Fast Q assessment • Homopolymers
Introduction (i) • Roche/454 GS-FLX sequencing: • Pyrosequencing • ± 400,000 reads/run • Average length: 200-250bp • Applications: • Resequencing: Variant identification • De novo (genome) sequencing: Assembly of new regions, plasmids or entire genomes • Standard Software: • Variants: Amplicon Variant Analyzer (AVA) • Assembly: Standard 454 assembler
Introduction (ii) • Standard software • + Easy to use • + reproducible results on similar datasets • + GUI (graphical user interface) • - No answer for ‘non-standard’ questions • Methylation experiments • Different types of experiments grouped together • … • - What about ‘hidden’ information? • Homopolymer error rates • Quality score ~ length of sequenced read • ‘Multirun’ information • …
Variant Identification Pipeline (i) • Modular and database oriented pipeline • Modular: • Efficient planning • Scalable • Database (DB): • No loss of data • Grouping several runs together
Variant Identification pipeline (ii) • Basic idea: Data is processed and stored in DB. Results (reports) are calculated ‘on the fly’ using the DB data. • Fast & efficient • Calculations only happen once • Everybody can access the database without risk of data modification • Reporting is independent from the dataprocessing • Paper: De Schrijver et al. 2009. Analysing 454 sequences with a modular and database oriented Variant Identification Pipeline
Possibilities of a DB oriented pipeline • VIP originally developed for variant identification • Now being used in: • Amplicon resequencing • De novo shotgun • Methylation • ~ solexa experiments • ‘Hidden’ data can be extracted using intelligent querying strategies • Results per lane/Multiplex MID/run…
Example: Detailed coverage • Coverage can be calculated per • Lane • MID • Amplicon • Base position • Assessment of errors (PCR dropouts vs. human errors)
Example: Improving PCR • Amplicon Resequencing experiment • Goal: Variant identification • Length distributions • Mapped • Unmapped • ‘Short’ mapped • Additional length separation + Improved PCR • Result: Improved efficiency
Example: Homopolymers • Can the length of a homopolymer be assessed using the Q score? • Yes, when homopolymer length < 6bp
Example: Q assessment • Fast assessment of the quality of a run Lab work OK Errors in lab work
Acknowledgements • Biobix – Ugent Wim Van Criekinge Tim De Meyer GeertTrooskens Tom Vandekerkhove Leander Van Neste GerbenMensschaert • CMG – UZ Gent Jo Vandesompele Jan Hellemans FilipPattyn Steve Lefever Kim Deleeneer Jean-Pierre Renard • NXT-GNT • Paul Coucke • SofieBekaert • Filip Van Nieuwerburgh • Dieter Deforce • Wim Van Criekinge • Jo Vandesompele
Questions ? Joachim.deschrijver@ugent.be