570 likes | 774 Views
Omixon Workshops. Considerations for Analyzing Targeted NGS Data – HLA. Tim Hague, CEO. Introduction. Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans Group of genes (' superregion ') on chromosome 6
E N D
OmixonWorkshops Considerations for Analyzing Targeted NGS Data – HLA Tim Hague, CEO
Introduction • Human leukocyte antigen (HLA) is the majorhistocompatibility complex (MHC) in humans • Group of genes ('superregion') on chromosome 6 • Essentially encodes cell-surface antigen-presenting proteins
Functions • HLA genes have functions in: • combating infectious diseases • graft/transplant rejection • autoimmunity • cancer
Alleles • Large number of alleles (and proteins) • Many alleles are already known The number of knownalleles is increasing
HLA Polymorphism HLA Class I Gene A B C Alleles 2013 2605 1551 Proteins 1448 1988 1119 • HLA Class II Gene DRA DRB* DQA1 DQB1 DPA1 DPB1 Alleles 7 1260 47 176 34 155 Proteins 2 901 29 126 17 134 • HLA Class II - DRB Alleles Gene DRB1 DRB3 DRB4 DRB5 Alleles 1159 58 15 20 Proteins 860 46 8 17
Analysis Challenges • HLA genes have specific analysis challenges regardless of the sequencing technology • HLA is the most polymorphic region of the human genome, and is difficult to analyze with any technique (including NGS) • Many repeated structures and pseudogenes • Some of the HLA genes have complex genetics • Difficult to find the appropriate reference genome • Phasing the heterozygous positions separated by more than one read length is problematic
HighPolymorphism High rate of polymorphism – up to 100 times the average human mutation rate • The HLA-DRB1 and HLA-B loci have the highest sequence variation rate within the human genome • High degree of heterozygosity – homozygotes are the exception in this region
Duplications • High level of segmental duplications • Lots of similar genes and lots of very similar pseudegenes • Duplicated segments can be more similar to each other within an individual than they are similar to the corresponding segments of the reference genome
ComplexGenetics • Particularly HLA-DRB* • The DR β-chain is encoded by 4 loci, however only no more than 3 functional loci are present in a single individual, and only a maximum of 2 per chromosome.
Mitigating Factors It's not all bad news: • Many HLA alleles are already well known – both in terms of sequence and frequencies within the population • The HLA region is fairly small so there a high degree of linkage disequilibrium, and therefore lots of known haplotypes
Traditional Typing • SSO – low resolution, high throughput, cheap • SSP – very fast results, low resolution • SBT – sequence-based typing, high resolution, usually done by Sanger sequencing
NGS Typing High resolution, an alternative to Sanger-based SBT Why is it needed?
Sanger and HLA • Sanger data is still the gold standard in the genomic sequencing industry, even though it is very expensive compared to NGS • 1 in 1'000 base error rate, if forward and reverse typing are done, error rate drops to 1 in 1'000'000
Phase Resolution • 2x chromosome 6 • Many loci, many alleles • Lots of heterozygosity
Allele Phasing problem reference sequence T/ A G/T consensus sequence OR??? Allele 1 T A T Allele 1 A Allele 2 Allele 2
The Problem with Sanger • There is only one signal • High degree of heterozygosity= high degree of ambiguity • Requires statistical techniques based on known allele frequencies, plus manual intervention by trained operators • Ambiguity can only be resolved statistically, which can lead to wrong assignment for rare types
HLA Typingby Sanger Method GGACSGGRASACACGGAAWGTGAAGGCCCACTCACAGACTSACCGAGYGRACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGMCGGT Number of PotentialAlleles
NGS Advantages • Can reduce ambiguity • Phase resolution - two signals, but lots of short reads • Cheaper and faster than Sanger • Less manual intervention required
NGS Approaches • HLA*IMP – chip based imputation engine • Reference-based alignment, followed by a HLA call based on the variants detected during alignment • Search against database of known alleles
NGS Reference-Based • Fraught with difficulties • Very hard to align reads to this region • The variant/HLA call is only as good as the alignment • No coverage = no call Has been attempted by Broad Institute (HLA Caller) and Roche
Alignment Efforts RainDanceprovide a targeted HLA amplification kit call HLAseq Target: the whole MHC superregion (except for some tandem repeat regions) Goal: align this data, before doing variant/HLA call
Diverse Variant “Density” in the MHC Superregion Based on a single sample
Default BWA Alignment • No coverage at an exon of HLA-DMB
Default BWA Alignment • Low coverage and orphaned reads at a HLA-DRB1 exon
BWA vsMore PermissiveAlignment • HigherCoverage = HigherNoise
Default BWA Alignment • Large targeted region without usable coverage
NGS Reference-Based Not providing enough coverage everywhere What about de novo?
De NovoAssembly(MIRA) • 287 contigs (longest contig: 2199 bp) • Mean contig size: 268 bp • Median contig size: 209 bp • Total consensus: 77084 bp • RainDancetarget: ~ 3800000 bp
NGS De Novo Alignment Not enough contigs produced, not enough coverage of the target region What about a hybrid approach?
De NovoAssemblywith “Backbone” • First, alignment to backbone, then de novo assembly • Backbone: 2220 contigs from HG19 chr 6 (sum: 3554852 bps) → almost whole RainDance target • Results: • Max reads / backbone contig: 197 • Max coverage: 71
NGS Typing - Alignment Based We tried: • Burrows Wheeler aligner • More sensitive, seed and extend aligner • De novo aligner • 'Hybrid' de novo aligner • The variant/HLA call is only as good as the alignment • The alignments were not good enough
NGS Database Base • Search against 'database' of known alleles • Such as IMGT/HLA database, available from EBI web site Stanford, Connexio, JSI Medical, BC Cancer Agency and Omixon have all tried this approach
DB Based Approach Advantages • Less mapping headaches • Unambiguous results • Potential to be fast Difficulties • Novel allele detection • Homozygous alleles
Conclusions • DB based approach to HLA typing is new but very promising • NGS approaches can resolve much of the ambiguity of Sanger SBT • DB based approach can also overcome the limitations of NGS reference-based alignment