1 / 22

The RIKEN CAGE data for the ENCODE U54 Project

The RIKEN CAGE data for the ENCODE U54 Project. Carsten O. Daub Genome Exploration Research Group Genomic Sciences Center RIKEN. Acknowledgements. Piero Carninci Yoshihide Hayashizaki The CAGE library team Mitsuyoshi Murata Sachiko Ishikawa Hiromi Nishiyori The Bioinformatics team

sidneyz
Download Presentation

The RIKEN CAGE data for the ENCODE U54 Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The RIKEN CAGE datafor theENCODE U54 Project Carsten O. Daub Genome Exploration Research Group Genomic Sciences Center RIKEN

  2. Acknowledgements • Piero Carninci • Yoshihide Hayashizaki • The CAGE library team • Mitsuyoshi Murata • Sachiko Ishikawa • Hiromi Nishiyori • The Bioinformatics team • Erik Arner • Timo Lassmann • Akira Hasegawa • Shiro Fukuda

  3. The Bioinformatics Team June 2007

  4. Sample • Prostate Tissue (total cell RNA) • CAGE tag poly A- (1 map) • by sequencing – RIKEN • 27 nt CAGE • Random primed

  5. Outline • Solexa and SOLiD sequencing of the same RNA library • QC • Mapping • Which tags can be used?

  6. CAGE - General comments • Previous rich CAGE experience with • RISA II CAGE - FANTOM 3 • 454 CAGE - FANTOM 4 • 20 - 22 nt CAGE tags • Not so rich CAGE experience with • Solexa CAGE • SOLiD CAGE • 27 nt CAGE tags • High expectations though :-)

  7. Solexa Run - Raw Data • We obtained 19 Million raw CAGE tags • From 7 lanes • 1 lane control • Sequencing finished Feb., 07th • Technical problem with dye intensities • Issue with reagents • Noticeable for read quality towards end of read

  8. SOLiD Run - Raw Data • We got 112 million raw tags • From one run • According to SOLiD sequencer specifications • 3 GigaBases per run • 30 bp / read • 100 million reads per run • Sequencing finished Feb., 29th • First impression is that run went well

  9. Quality Control • Solexa and SOLiD provide phred like quality scores • The sequence quality drops towards the end of the reads • for Solexa this is known • SOLiD as well, probably due to CAGE library format • We can use the ability to map a CAGE tag and the mapping location as a measure of quality

  10. Quality Control • Solexa • Internal phred like scores currently not used • Evaluating now how they can be used • SOLiD • Recommendation by AB - can be improved for CAGE • Still evaluating how to use the internal phred like scores

  11. Mapping - Software • We use an in-house developed mapping program • No heuristics • Perfect mapping • Up to 3 mismatches • Up to 1 indel (454 homo-polymer issue) • Very fast, suffix array based • For reads up to 50 bp • Can handle • 454 • Solexa • SOLiD (color space)

  12. Mapping - Previous Experience • We have a rich experiencewith 454 CAGE • 454 mapping rates above 90% • Little experience with Solexa and SOLiD CAGE

  13. Mapping - Solexa • We can map 6 million tags out of 20 million now (30%) • For full length 27 nt tags • Now working on refined strategy • Mapping with up to 2 mismatches, no indel • Chop of last base • Reiterate mapping with up to 2 mismatches • … • Until it maps • Previously published strategy

  14. Mapping - Solexa length tags map multim multim>100 unmapped 27 10,000 4,579 543 170 4,708 26 10,000 4,879 596 179 4,346 25 10,000 4,980 635 199 4,186 24 10,000 5,073 707 213 4,007 23 10,000 5,224 827 230 3,719 22 10,000 5,476 1,151 253 3,120 21 10,000 5,669 1,844 280 2,207 20 10,000 5,371 3,221 307 1,101 Mapping location is important!

  15. Mapping - Solexa to TSS

  16. Mapping - SOLiD • Huge dataset, not completely mapped yet • Now test mapping on subsets • Set 1: • randomly selected tags • 109,046 raw tags • Set 2: • quality filtered set • AB suggestion: exclude tags with more than 6 QVs < 10 • 22,114,693 (18% of all tags) in total • 100,000 tags randomly selected as subset

  17. Mapping - SOLiD

  18. Mapping - SOLiD to Exons

  19. Multimapping • Novel developed • Multi mapping CAGE tags are assigned to • Several positions • With different probabilities • Depending on the genomic context • Genomics, 2008 Mar;91(3):281-8

  20. Mapping - Data Format • Mapping file in BLAST8 or BLAST9 format • Without multimapping ‘rescue’ • RIKEN format includes • At what step is a tag mapping • Probabilities for multimapping tags • Excluding tags mapping to ribosomal genome

  21. Summary • Solexa run had problems with dye intensities • SOLiD run appears be good • Mappings are still under development • First mappings already make sense • 6 million Solexa + 6 million SOLiD • Will be improved! • Mapping seems a reasonable criterion to decide which CAGE tags to use • More evaluation needed • In year 1

  22. Open Questions • Provide data or analysis? • Data format • What to provide • How raw is raw? • How to provide? • Genome version, other details

More Related