520 likes | 694 Views
Wellcome Trust Medical Photographic Library. International Tomato Finishing Workshop. Wellcome Trust Sanger Institute April 2007. Overview. Tomato Genome Finishing Standards Document on SGN. WTSI Finishing Strategy. WTSI Finishing Pipeline.
E N D
Wellcome Trust Medical Photographic Library International Tomato Finishing Workshop Wellcome Trust Sanger Institute April 2007
Overview Tomato Genome Finishing Standards Document on SGN WTSI Finishing Strategy WTSI Finishing Pipeline Contiguous Finished Sequence to HTGS Phase 3 All bases above phred 30 Base Error rate <1:10,000 Discussion on Day 2
WTSI Finishing Pipeline Established Clone Pipeline Auto- Prefinishing Shotgun Sequencing Manual Finishing QC Final EMBL Submission HTGS 3 Finishing Software
Finishing Software • Main software tools used in the WTSI finishing process: • Sequence data viewed in Gap4 databases (Staden) • (assemblies created using phrap) • Read pair viewer – Orchid (Flowers) • Restriction Digest Viewer - Confirm (Attwood - WTSI) • Sequence plot viewer – Dotter (Sonnhammer) • (URLs on Handout) • Used throughout the finishing process and for final confirmation of assembly (and QC)
WTSI Finishing Strategy BAC Confirmation Identify Region to be Finished Contig Order and Orientation Assessment of Gap Sizes and Type Selection of Finishing Reactions Improvement of Low Quality Sequence Confirmation of Contiguous Assembly
Finishing Strategy – Getting Started BAC confirmation Confirming BAC Clone Ends Checking for overlapping BACs Identifying Region to be Finished Prevents overlaps being finished twice Confirmation of clone placement on map Resources available for BAC confirmation
Identifying region to be finished in your BAC:Whether overlapping BACs are available No overlaps available Overlapping BAC available Confirm ends of your BAC and overlapping BAC (BES or sequence if available) Confirm BAC Ends (BES) Confirm status of overlap (shotgun, in finishing, finished) If overlap is not being finished by someone else Finish whole BAC insert Confirm region to be finished (who is finishing overlap) Finish 2Kb into overlap if already finished or being finished by someone else
Resources available for BAC confirmation • SOL Genomics Network (SGN) • BES • Marker verification • Blast • repeats, unigenes, ESTs, markers, overlaps
Aligning BES to BAC sequence data BACs are clipped to cloning vector cutsite, dependant on the library used: SL_Mbol Library = GATC LE_HBa Library = AAGCTT
Sequence Resources Searching for Finished Overlaps BLAST Match to self (bTH198L24) Match to left overlap (bTH119A16) Match to right overlap (bTH27G19)
Finishing 2Kb into existing finished overlapping BACs Overlapping sequence ends at 48435 Finished region will begin at 46435 to give a 2Kb overlap
Making use of available overlaps Finished consensus spans 2 gaps reducing number of contigs to finish
Sequence VerificationSearching for Expected Markers Marker size would be 1284bp Matches predicted product size for S.lycopersicum on SGN
Finishing Strategy BAC Confirmation Identify Region to be Finished Contig Order and Orientation Use read pair information to order and orientate contigs Plasmid inserts typically 4-6Kb Both strands sequenced to give read pairs
Left Clone End Contig Order and Orientation Shotgun Sequencing Double Stranded Sequencing Vector (pUC) Forward Strand Reverse Strand Inserted sequence (BAC) Assembled Sequence Contigs Forward and Reverse Read pairs 4-6Kb apart in assembly Sequence Gap Look at read pair information across gaps to order contigs Right Clone End Good read pair link across gap
Contig Order and Orientation Assembled sequence contigs Sequence Gap Right Clone End Left Clone End Using Read Pair information to find Assembly Problems Right Clone End Left Clone End
OrchidRead pair Visualisation Tool Contiguous sequence with good read pair coverage
Finishing Strategy BAC Confirmation Identify Region to be Finished Contig Order and Orientation Assessment of Gap Sizes and Type Restriction Digest Data Assess Sequence Use all available information Type of gap and size determines finishing approach
Restriction Digest Data • Used in confirmation of finished contiguous assembly • Also used throughout finishing process • Sizing of gaps within BACs • use appropriate finishing strategy • Identifying assembly problems • caused by repeats • Sizing of repeats • confirming size of assembly of tandem repeats • sizing force joins made in repeats for tagging purposes
Restriction Digests • Minimum of three restriction enzymes used to confirm the assembly • Selection depends on organism and the nature of the sequence • S. lycopersicum BACs are digested with • BamHI • EcoRI • HindIII • Comparison of real and virtual digest of entire BAC sequence
Compare fragment lengths from virtual digest in gap4 to actual fragment sizes on the gel produced in the lab Gap4 - Restriction Digest Viewer
Using Restriction Digest Datato check for Assembly Problems • Identifying assembly problems from digests • mis-assemblies caused by repeats • direct repeats • Inverted repeats • All digests showing similar amount of missing data or extra data at a particular position • Possible repeat with incorrect copy number represented • Certain digests show too much data, others have missing cutsites or data missing • Possible inverted repeat in wrong orientation • Possible E.coli transposon insertion
Assessment of Sequence - Dotter • Sequence plot of BAC used throughout finishing process • Check for repeats sequences at gaps • Highlight any potential areas of mis-assembly • Also used to confirm sequence overlaps • Confirm unique sequence • Not false repeat matches • Used as final assembly check • Repeats • Cross reference sizes with restriction digests
Sequence Plot – Assembly Check Repeat Examples Inverted Repeat Direct Repeat
WTSI Finishing Strategy BAC Confirmation Identify Region to be Finished Contig Order and Orientation Assessment of Gap Sizes and Type Selection of Finishing Reactions Improvement of Low Quality Sequence
Options for Gap Closure and Improving Sequence Quality Depending on length of region or gap and associated sequence (repeat, structural problems) Resequencing of subclones across region if appropriate read length, using alternative chemistries if possible Sequence any unpaired reads which may fall in low quality region or in gap Primer walking on subclones across region or gap Direct clone walks PCR SIL or TIL Manual Editing Comment Tag for EMBL submission
Gap Closure in BACs – Gap Types Un-spanned Gap Spanned Gap Re-sequencing (read pairs) Oligo walks Direct clone walks PCR Small Insert Libraries, Transposon Libraries Restriction Fragment Library Repeats Alternative Library Sizes
Assembled Sequence Contigs Forward and Reverse Read pairs 4-6Kb apart in assembly Look at read pair information across gaps to order contigs Sequence Gap Left Clone End Right Clone End Good read pair link across gap Primer Walking into Spanned Gaps Assembled Sequence Contigs Primer 1 Primer 2 Good read pair links across gap Original shotgun templates Primer extended template Gap Closed
Primer Walking into Spanned Gaps Assembled Sequence Contigs Primer 1 Primer 2 Primer 3 Primer 4
Small Insert Library (SIL) Assembled Sequence Contigs Spanning Shotgun Template 4-6Kb insert SIL templates average 300-500bp insert Spanning subclone is shattered into smaller fragments to create a SIL. Smaller insert sizes can break up structural problems.
Transposon Insertion Library (TIL) Double Stranded Sequencing Vector (pUC) Inserted sequence (BAC)
Transposon Insertion Library (TIL) Double Stranded Sequencing Vector (pUC) Normal sequencing from either end of insert Read pairs ~4-6Kb apart Inserted sequence (BAC)
Transposon Insertion Library (TIL) Double Stranded Sequencing Vector (pUC) Normal sequencing from either end of insert Read pairs ~4-6Kb apart Inserted sequence (BAC) Transposon randomly inserts across entire plasmid Sequence outwards from transposon insertion site
TIL Read pairs overlap by 9bp duplication site Transposon Insertion Library (TIL) Double Stranded Sequencing Vector (pUC) Sequence outwards from transposon insertion site Inserted sequence (BAC) Transposon randomly inserts across entire plasmid
Unspanned Gaps and gaps unresolved by walking on spanning subclones Assembled Sequence Contigs Resequence any unpaired reads that face into gap Partner may fall in gap, reducing gap size or may fall within other contig and span the gap.
Unspanned Gaps and gaps unresolved by walking on spanning subclones Assembled Sequence Contigs Primer 1 Primer 2 Primer Sequence needs to be unique Sequence search facility in Gap4 No unpaired reads. Design oligo primers from each contig end to read into gap. Use for walking directly on BAC (clone/stock) DNA and PCR Try to find unique sequence within BAC for oligo selection
Primer 1 Primer 2 Primer 3 Primer 4 Direct Clone Walks Assembled Sequence Contigs Depending on gap size (from restriction digest data) the direct clone walks may close the gap. Alternatively they may extend into the gap allowing further primers to be designed on the newly recovered sequence
PCR Assembled Sequence Contigs Primer 1 Primer 2 The same principle applies to PCR. Design unique primers from each contig end to obtain a product that can be sequenced and extended with further primer walking. If confirmed to span the gap a PCR product may be shattered into a SIL but may skip out repetitive sequence.
Cutsite 1 Cutsite 2 SIL from Restriction Fragmet Assembled Sequence Contigs Shatter this Fragment of Digested BAC DNA Sequence gap known to be within this fragment Alternatively a restriction fragment known to contain the missing data can be isolated from the digest gel and be made into a SIL. The fragment of interest must be distinct from other fragments on the gel and be a suitable size.
Repeat Unit Size of Unit Copy Number Direct or Inverted Copies How Conserved? Lower Complexity e.g. Di-nucleotide Runs Higher Complexity e.g. LTRs Gaps and Assembly ProblemsCaused by Repeats Varying complexity of repeats depending on: Importance of visualising repeat sequence to assess repeat type Alter phrap parameters for more stringent assembly Alternative library sizes if necessary Discussion point for Tuesday
Improving Sequence Quality - Summary Depending on length of poor quality region and associated sequence (repeat, structural problems) Resequencing of subclones across region if appropriate read length, using alternative chemistries if possible Sequence any unpaired reads which may fall in region Primer walking on subclones across region Direct clone walks PCR SIL or TIL Manual Editing Comment Tag for EMBL Submission
WTSI Finishing Strategy BAC Confirmation Identify Region to be Finished Contig Order and Orientation Assessment of Gap Sizes and Type Selection of Finishing Reactions Improvement of Low Quality Sequence Confirmation of Contiguous Assembly
Confirmation of contiguous sequence Contiguous Sequence Generated No Quality Issues Remain All Assembly checks completed Read pair coverage Dotplot Restriction Digests Identify any regions to be tagged QC check Final Submission of Finished Sequence to EMBL as HTGS Phase 3