320 likes | 569 Views
1000 Genomes Project Haplotype Integration. Androniki Menelaou University Medical Center Utrecht. Phase 1 integrated haplotypes. Haplotypes from 1,092 samples. The official release can be found here: http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets /
E N D
1000 Genomes Project Haplotype Integration Androniki Menelaou University Medical Center Utrecht
Phase 1 integrated haplotypes • Haplotypes from 1,092 samples. • The official release can be found here:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/ • It includes: • 38 million single nucleotide polymorphisms, • 1.4 million short insertions and deletions
Phase 1 integrated haplotypes • Information on the samples:http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/integrated_call_samples.20101123.ped • Build37
Use of phased haplotypes • used to infer human demographic history • inference of points of recombination • helps in understanding the interplay of genetic variation and disease • imputation of un-typed genetic variation
Human disease genetics SNPs (usually > 500,000 genome-wide) g/g a/c g/t g/a t/t a/t a/c a/g t/c g/g g/g t/c t/a a/a g/g c/c t/g g/g t/c t/a c/a g/a t/t g/t g/a t/c a/a c/a Cases and Controls (usually > 1000) Genome-wide SNP microarray genotypes Jonathan Marchini
Human disease genetics SNPs (usually > 500,000 genome-wide) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Haplotypes are estimated using statistical methods Jonathan Marchini
Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Jonathan Marchini
Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) g a g g t a a g c t a t t c a t g g t t a g c g g c a a g c t g t t c g c g g c a a g t g g t a c a t t a c a a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini
Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c Reference haplotypes via sequencing studies 1000 Genomes Project (~2,200 haplotypes) a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a Cases and Controls (usually > 1000) Imputation of unobserved alleles via matching of shared haplotypes Jonathan Marchini
Human disease genetics a g a g t t g a g g g a a c c t g a g a a t g a g a c g a g g g a a a t t g a g a c t g c g a c g g t g a t t c t c c a g a c a g c g a c g a t g g t a c t t g a t c a t a a g t t a g t a a t t c c c g a g c a t g c a a t g a g g g a a a t t g t t a a a g a g a c g g g g g a a a t t c t g c c • GWAS of imputed genotypes • Increased power • Better resolution • Facilitates meta-analysis a g a g t a g a g g g t a c t t g a t c a t g c g a c g g t g a t t c t t c t g c c t a a a a t g a g g g a a a t t g t t a a t g a g a c g a g g g a a c c c g a g c a a g c g a c g a t g g t a a t t c t g c c a g a g a c g a g g g a a c c t g a g a a t g c a a t g a g g g a a a t t g a g a c t a a g t t a g t a a t t c c t g a t c a
Imputation using the 1000 Genomes data • Samples are genotyped on a microarray (e.g. Affy500k, Illumina1M etc) • Quality Control • Choose an imputation algorithm: • BEAGLE (http://faculty.washington.edu/browning/beagle/beagle.html) • IMPUTE2 (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html) • MINIMAC(http://genome.sph.umich.edu/wiki/Minimac)
Imputation using the 1000 Genomes data • NOTE : All imputation software have converted the 1000 Genomes haplotypes to their required format (check their websites) • Impute samples using the 1000 Genomes as a reference panel.
Imputation using the 1000 Genomes data ./impute2 \ -m ./Example/example.chr22.map \ -h ./Example/example.chr22.1kG.haps \ -l ./Example/example.chr22.1kG.legend \ -g ./Example/example.chr22.study.gens \ -strand_g ./Example/example.chr22.study.strand \ -int 20.4e6 20.5e6 \ -Ne 20000 \ -o ./Example/example.chr22.one.phased.impute2
Extracting information from the haplotypes • Interested on the allele frequency for a variant in the 1000 Genomes • Focus on a specific set of samples (e.g. only the European samples) • Filter some positions
Extracting information from the haplotypes VCFTools : http://vcftools.sourceforge.net/index.html • A program package designed for working with VCF files • Validate, merge, compare and calculate some basic population genetic statistics.
Extracting information from the haplotypes • E.g. ./vcftools \ --gzvcf ALL.chr1.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz \ --freq \ --out chr1
From Phase 1 to Phase 3 • ~2,500 samples • Site detection : Multiple methods are employed for site detection • Type of variants : Different types of variants to be included (e.g. STRs, multi-allelic variants) • Integrated haplotypes : Haplotypes to include SNPs, indels, complex variants and SVs
SNPs, indels, MNPs and multi-allelic variants Local Assembly Alignment based Global assembly Freebayes Haplotype Caller Platypus SNPTools Unified Genotyper samtools RTG snp GotCloud SGA / DINDEL Cortex
Summary Adrian Tan, Hyun Min Kang, Goncalo Abecasis *Autosomes only, unfiltered set
Structural Variants • Variant classes • deletions (26k, length : 204bp – 100kb) • bi-allelic tandem and dispersed duplications • multi allelic CNVs • balanced inversions • mobile element insertions • nuclear mitochondrial insertions Jan Korbel
STRs • Two methods are used for STR detection : • lobSTR • RepeatSeq • ~1.5m STRs detected Gareth Highnam, Thomas Willems, David Mittelman, YanivErlich
Phase 3 reference panel • The pipeline for the construction of the reference panel combines both the microarray and sequencing data of the samples in the project. • Genotype calling and phasing software used : SHAPEIT2 and MVNcall
Step 1 : Create scaffold Individual microarray SNPs
Step 2 : Phase bi-allelic sites Individual Bi-allelic SNPs, indels, Structural variants
Step 3 : Phase multi-allelic variants Individual Multi-allelic and other complex variants
Downstream imputation experiment Performance of haplotypes sets for imputation July 2012 release (phase 1) New pipeline (phase 1) New pipeline (phase 3) Genotypes at Chip SNPs Genotypes at SNPs not on chip Imputation Complete Genomics High Coverage Genotypes Compare R2 between imputed and true genotypes
Downstream imputation accuracy New pipeline phase 3 New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini
Downstream imputation accuracy New pipeline phase 3 c New pipeline phase 1 July 2012 phase 1 Olivier Delaneau, Jonathan Marchini
Phase 3 reference panel • Includes more samples from diverse populations • The number of SNPs will increase (~75m) • Inclusion of different types of variants • Higher haplotype accuracy due to methods development which will lead to higher downstream imputation accuracy. • Timeline: Summer 2014