The RIKEN CAGE data for the ENCODE U54 Project

The RIKEN CAGE datafor theENCODE U54 Project Carsten O. Daub Genome Exploration Research Group Genomic Sciences Center RIKEN

Acknowledgements • Piero Carninci • Yoshihide Hayashizaki • The CAGE library team • Mitsuyoshi Murata • Sachiko Ishikawa • Hiromi Nishiyori • The Bioinformatics team • Erik Arner • Timo Lassmann • Akira Hasegawa • Shiro Fukuda

The Bioinformatics Team June 2007

Sample • Prostate Tissue (total cell RNA) • CAGE tag poly A- (1 map) • by sequencing – RIKEN • 27 nt CAGE • Random primed

Outline • Solexa and SOLiD sequencing of the same RNA library • QC • Mapping • Which tags can be used?

CAGE - General comments • Previous rich CAGE experience with • RISA II CAGE - FANTOM 3 • 454 CAGE - FANTOM 4 • 20 - 22 nt CAGE tags • Not so rich CAGE experience with • Solexa CAGE • SOLiD CAGE • 27 nt CAGE tags • High expectations though :-)

Solexa Run - Raw Data • We obtained 19 Million raw CAGE tags • From 7 lanes • 1 lane control • Sequencing finished Feb., 07th • Technical problem with dye intensities • Issue with reagents • Noticeable for read quality towards end of read

SOLiD Run - Raw Data • We got 112 million raw tags • From one run • According to SOLiD sequencer specifications • 3 GigaBases per run • 30 bp / read • 100 million reads per run • Sequencing finished Feb., 29th • First impression is that run went well

Quality Control • Solexa and SOLiD provide phred like quality scores • The sequence quality drops towards the end of the reads • for Solexa this is known • SOLiD as well, probably due to CAGE library format • We can use the ability to map a CAGE tag and the mapping location as a measure of quality

Quality Control • Solexa • Internal phred like scores currently not used • Evaluating now how they can be used • SOLiD • Recommendation by AB - can be improved for CAGE • Still evaluating how to use the internal phred like scores

Mapping - Software • We use an in-house developed mapping program • No heuristics • Perfect mapping • Up to 3 mismatches • Up to 1 indel (454 homo-polymer issue) • Very fast, suffix array based • For reads up to 50 bp • Can handle • 454 • Solexa • SOLiD (color space)

Mapping - Previous Experience • We have a rich experiencewith 454 CAGE • 454 mapping rates above 90% • Little experience with Solexa and SOLiD CAGE

Mapping - Solexa • We can map 6 million tags out of 20 million now (30%) • For full length 27 nt tags • Now working on refined strategy • Mapping with up to 2 mismatches, no indel • Chop of last base • Reiterate mapping with up to 2 mismatches • … • Until it maps • Previously published strategy

Mapping - Solexa length tags map multim multim>100 unmapped 27 10,000 4,579 543 170 4,708 26 10,000 4,879 596 179 4,346 25 10,000 4,980 635 199 4,186 24 10,000 5,073 707 213 4,007 23 10,000 5,224 827 230 3,719 22 10,000 5,476 1,151 253 3,120 21 10,000 5,669 1,844 280 2,207 20 10,000 5,371 3,221 307 1,101 Mapping location is important!

Mapping - Solexa to TSS

Mapping - SOLiD • Huge dataset, not completely mapped yet • Now test mapping on subsets • Set 1: • randomly selected tags • 109,046 raw tags • Set 2: • quality filtered set • AB suggestion: exclude tags with more than 6 QVs < 10 • 22,114,693 (18% of all tags) in total • 100,000 tags randomly selected as subset

Mapping - SOLiD

Mapping - SOLiD to Exons

Multimapping • Novel developed • Multi mapping CAGE tags are assigned to • Several positions • With different probabilities • Depending on the genomic context • Genomics, 2008 Mar;91(3):281-8

Mapping - Data Format • Mapping file in BLAST8 or BLAST9 format • Without multimapping ‘rescue’ • RIKEN format includes • At what step is a tag mapping • Probabilities for multimapping tags • Excluding tags mapping to ribosomal genome

Summary • Solexa run had problems with dye intensities • SOLiD run appears be good • Mappings are still under development • First mappings already make sense • 6 million Solexa + 6 million SOLiD • Will be improved! • Mapping seems a reasonable criterion to decide which CAGE tags to use • More evaluation needed • In year 1

Open Questions • Provide data or analysis? • Data format • What to provide • How raw is raw? • How to provide? • Genome version, other details

The RIKEN CAGE data for the ENCODE U54 Project

The RIKEN CAGE data for the ENCODE U54 Project

Presentation Transcript

The CAGE Questionnaire for Drug Company Dependence

Couple in the Cage:

RDCRN Data Management Coordination Center (DMCC) U54

Data needed for the project

Practical Guide to the (mod)ENCODE project

The Memory Cage

The Memory Cage

Making Sense of the ENCODE Project ( ENCyclopedia Of DNA Elements) Data

The cage

How to download the data collection form, encode data and submit file

The Oregon DATA Project

The Data Exchange Project

The Data Efficiency Project

D A S for ENCODE data coordination

ENCODE, BETHESDA

BEARING CAGE AUTOMATION PROJECT

The ENCODE Project + After Party !!!!!

VIRUSES ENCODE INFORMATION FOR

Data for the Project

Data for the SUST-RUS project

“Junk DNA” Can Encode the Proteins

The Eggen Data Project