320 likes | 538 Views
Extracting genetic variation from human genome sequences Stephen Sherry, PhD. pipeline for 1000 genomes cSRA deployment software to support use of NGS data post NGS analysis data sets. 1000 genomes project: goals.
E N D
Extracting genetic variation from human genome sequencesStephen Sherry, PhD • pipeline for 1000 genomes • cSRA deployment • software to support use of NGS data • post NGS analysis data sets
1000 genomes project: goals • A public database of essentially all SNPs and detectable CNVs with allele frequency >1% in each of multiple human population samples • N=2,600 = 100 each from 26 populations • Pioneer and evaluate methods for: • Generating data from next-generation sequencing platforms • Exchanging and combining data and analytical methods • Discovering and genotyping SNPs and CNVs from nextgendata • Imputation with and from next generation sequencing data
1000 Genomes Project Sampling Sites Finland United Kingdom Beijing, China Italy Xishuangbanna, China Utah, U.S. Southwest U.S. Japan Mississippi, U.S. Pakistan Spain Puerto Rico Shenzhen, China California, U.S. Gambia India Vietnam Barbados Nigeria Colombia Kenya Ghana Peru Malawi
Primary project data formats • FASTQ • sequences with base qualities • @IL11_193:4:1:878:501 • TATTTTGACTTTGAGCGTATCGAGGCTCTTTAACCTGAACGTCAGAAGCAGCCTTATGGCCGTCAACATACC • + • IIIIIIIIIIIIIIIIIIIIIIIIIIIIII1IDII<IIIIIIIIIIIIIIIIIIIIIIIIII(I&/97.,8& SAM/BAM multiple sequence alignments • @HD VN:1.0 • @SQ SN:chr20 LN:62435964 • @RG ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 • @RG ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 • read_28833_29006_6945 99 chr20 28833 20 10M1D25M = 28993 195 \ • AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG<<<<<<<<<<<<<<<<<<<<<:<9/,&,22;;<<< \ • NM:i:1 RG:Z:L1 • read_28701_28881_323b 147 chr20 28834 30 35M = 28701 -168 \ • ACCTATATCTTGGCCTTGGCCGATGCGGCCTTGCA<<<<<;<<<<7;:<<<6;<<<<<<<<<<<<7<<<< \ • MF:i:18 RG:Z:L2
Primary project data formats VCF variants with genomic location & genotypes • ##fileformat=VCFv4.0 • ##fileDate=20100721 • ##source=VCFtools • ##reference=NCBI36 (preferred use is assembly accession.version) • ##INFO= <ID=AA, Number=1, Type=String, Description="Ancestral Allele"> • ##INFO= <ID=H2, Number=0, Type=Flag, Description="HapMap2 membership"> • ##FORMAT=<ID=GT, Number=1, Type=String, Description="Genotype"> • ##FORMAT=<ID=GQ, Number=1, Type=Integer, Description="Genotype Quality"> • ##FORMAT=<ID=DP, Number=1, Type=Integer, Description="Read Depth"> • ##ALT= <ID=DEL, Description="Deletion"> • ##INFO= <ID=SVTYPE, Number=1, Type=String, Description="Type of structural variant"> • ##INFO= <ID=END, Number=1, Type=Integer, Description="End position of the variant"> • #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE1 SAMPLE2 • 1 1 . ACG A,AT 40 PASS . GT:DP 1/1:13 2/2:29 • 1 2 . C T,CT . PASS H2;AA=T GT 0|1 2/2 • 1 5 rs12 A G 67 PASS . GT:DP 1|0:16 2/2:20 • X 100 . T <DEL> . PASS SVTYPE=DEL;END=300 GT:GQ:DP 1:12:15 0/0:20:13
BAM FASTQ VCF VCF
Results & Benefits Whole genomes and exomes can be efficiently stored in 1/3 to 1/10 of the space of a BAM file cSRA lossless compression achieved ~3x reduction in size (bits per base) as compared to the original BAM files. Near-lossless compression (4- and 8-levels quantization on base qualities) furthered reduced bits/base to <2 bits per base demonstrating reduced storage requirement for these sequences. cSRA can be quickly ‘sliced’ to extract genomic intervals of interest in BAM, SAM or FASTQformat The format stores original base qualities (OQ) or recalibrated quality scores (RQ) and can produce recalibrated quality scores (RQ) during extraction. Conversion times are rapid: BAM-to-cSRA can be encoded at 15-20 GB per hour per 2 CPU core. Processing requires significant RAM resources to match up paired read names during mate pair reconstruction. Memory requirements are typically 1/3 of the size of the BAM input file.
Sources of problematic alignments: • low complexity regions • imperfect aligner technology • lack of essential quality control • Errors corrected by cSRA • incorrect mate flags • inconsistent quality flags • errors in CIGAR strings • multiple placements • Errors impact variant detection and may introduce false positives into the final variant call set.
Variation detection: comparison of unfiltered call sets produced by variation detection pipeline using submitted and archive-restored BAMs • Comparisons will include measures of false negatives — the potential variations that would be ‘lost’ by archive treatment • Lists of variants that truncate proteins should be evaluated. These include nonsense (stop gain), loss of transcription start site, and splice site donor and acceptor positions. • SV Performance with and without map quality (CREST) Empirical testing & validation possible • Individual genotyping accuracy • Accuracy of remapping to new assemblies. • Pipeline consequences for dropping secondary base calls
Stationary night blindness due to premature termination in TRPM1 Nonsense CA in TRPM1 Data for NA11918 placed by two different aligners (mosaik & bwa) All individual genotypes For rs3784589 Deanna Church & Eugene Yaschenko
Developing characterized gDNA reference material for NGS • NCBI contributions: • analyze sequence data and variant calls in target gene regions • create consensus VCFs for NA12878 and NA19240 • host a genome-specific browser for published sequences and genotype calls • NIST will use this information to further develop standard reference materials for NGS Technology-specific genotypes from publications and Groups collaborating in the GET-RM project.
A genotype dataset is a very large matrix with orthogonal access patterns Problem: Solution: Divide the data into chunks List all genotypes for a given variation List all variations for a subject vcf asn.1 json xml SciDB Cluster (array-based storage) • 1000G November 2010 release (pilot) • 18GB compressed VCF • 38.8m SNPs • 1000G May 2011 release (phase 1) • 164.4 GB compressed VCF • 38.2M SNPs • 3.9M Short Indels • 14K Deletions querygt.cgi Douglas Slotta
ClinVar: organizing allele significance relative to disorders e.g. Severe Combined Immunodeficiency Disease Variants co-observed in affected patients Allele focus of the report Semantic properties of the disorder or phenotype Donna Maglott and Wendy Rubinstein
The translational research process has archives at each stage Genome Biology Medicine PheGenI OMIM 1000 Genomes Genetic Test Registry NHGRI GWAS Gene SRA dbVar Clinvar dbGaP dbSNP Pharm GKB RefSeq Gene
THANK YOU 1000 Genomes Roadmap Chunlin Xiao Genotype archive Chunlei Liu Douglas Slotta Variation Pipeline Chunlin Xiao Anatoly Mnev GonçaloAbecasis, U Mich Tom Blackwell, U Mich Gabor Marth, Boston College Alistair Ward, Boston College 1000 Genomes Browser Victor Ananiev Deanna Church Cliff Clausen Rob Cohen Peter Meric The sequence viewer team! The SRA team! cSRA deployment Chunlin Xiao Michael Kimelman Eugene Yaschenko The VDB team! Systems Chris Cope Don Preuss GetRM Browser Richa Agarwala Deanna Church Donna Maglott Chris O’Sullivan Chunlin Xiao Eugene Yaschenko The dbSNP team! The Clinical Variation team!