1 / 53

Omixon Workshops

Omixon Workshops. Considerations for Analyzing Targeted NGS Data – HLA. Tim Hague, CEO. Introduction. Human leukocyte antigen (HLA) is the major histocompatibility complex (MHC) in humans Group of genes (' superregion ') on chromosome 6

rhian
Download Presentation

Omixon Workshops

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OmixonWorkshops Considerations for Analyzing Targeted NGS Data – HLA Tim Hague, CEO

  2. Introduction • Human leukocyte antigen (HLA) is the majorhistocompatibility complex (MHC) in humans • Group of genes ('superregion') on chromosome 6 • Essentially encodes cell-surface antigen-presenting proteins

  3. Functions • HLA genes have functions in: • combating infectious diseases • graft/transplant rejection • autoimmunity • cancer

  4. Alleles • Large number of alleles (and proteins) • Many alleles are already known The number of knownalleles is increasing

  5. HLA Polymorphism HLA Class I Gene A B C Alleles 2013 2605 1551 Proteins 1448 1988 1119 • HLA Class II Gene DRA DRB* DQA1 DQB1 DPA1 DPB1 Alleles 7 1260 47 176 34 155 Proteins 2 901 29 126 17 134 • HLA Class II - DRB Alleles Gene DRB1 DRB3 DRB4 DRB5 Alleles 1159 58 15 20 Proteins 860 46 8 17

  6. Analysis Challenges • HLA genes have specific analysis challenges regardless of the sequencing technology • HLA is the most polymorphic region of the human genome, and is difficult to analyze with any technique (including NGS)  • Many repeated structures and pseudogenes • Some of the HLA genes have complex genetics • Difficult to find the appropriate reference genome • Phasing the heterozygous positions separated by more than one read length is problematic

  7. HighPolymorphism High rate of polymorphism – up to 100 times the average human mutation rate • The HLA-DRB1 and HLA-B loci have the highest sequence variation rate within the human genome • High degree of heterozygosity – homozygotes are the exception in this region

  8. ClassicalAlignmentto a Reference

  9. Duplications • High level of segmental duplications • Lots of similar genes and lots of very similar pseudegenes • Duplicated segments can be more similar to each other within an individual than they are similar to the corresponding segments of the reference genome

  10. Homology, Pseudogenes & Repeats

  11. ComplexGenetics • Particularly HLA-DRB* • The DR β-chain is encoded by 4 loci, however only no more than 3 functional loci are present in a single individual, and only a maximum of 2 per chromosome.

  12. HG19 Haplotypesatthe HLA Region

  13. Mitigating Factors It's not all bad news: • Many HLA alleles are already well known – both in terms of sequence and frequencies within the population • The HLA region is fairly small so there a high degree of linkage disequilibrium, and therefore lots of known haplotypes

  14. Traditional Typing • SSO – low resolution, high throughput, cheap • SSP – very fast results, low resolution • SBT – sequence-based typing, high resolution, usually done by Sanger sequencing

  15. NGS Typing High resolution, an alternative to Sanger-based SBT Why is it needed?

  16. Sanger and HLA • Sanger data is still the gold standard in the genomic sequencing industry, even though it is very expensive compared to NGS • 1 in 1'000 base error rate, if forward and reverse typing are done, error rate drops to 1 in 1'000'000

  17. Phase Resolution • 2x chromosome 6 • Many loci, many alleles • Lots of heterozygosity

  18. Allele Phasing problem reference sequence T/ A G/T consensus sequence OR??? Allele 1 T A T Allele 1 A Allele 2 Allele 2

  19. The Problem with Sanger • There is only one signal • High degree of heterozygosity= high degree of ambiguity • Requires statistical techniques based on known allele frequencies, plus manual intervention by trained operators • Ambiguity can only be resolved statistically, which can lead to wrong assignment for rare types

  20. The Problem with Sanger

  21. HLA Typingby Sanger Method GGACSGGRASACACGGAAWGTGAAGGCCCACTCACAGACTSACCGAGYGRACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGMCGGT Number of PotentialAlleles

  22. HLA Typingby Sanger Method

  23. HLA Typingby Sanger Method

  24. NGS Advantages • Can reduce ambiguity • Phase resolution - two signals, but lots of short reads • Cheaper and faster than Sanger • Less manual intervention required

  25. NGS Data - Unphased

  26. NGS Data - Phased

  27. NGS Approaches • HLA*IMP – chip based imputation engine • Reference-based alignment, followed by a HLA call based on the variants detected during alignment • Search against database of known alleles

  28. NGS Reference-Based • Fraught with difficulties • Very hard to align reads to this region • The variant/HLA call is only as good as the alignment • No coverage = no call Has been attempted by Broad Institute (HLA Caller) and Roche

  29. Alignment Efforts RainDanceprovide a targeted HLA amplification kit call HLAseq Target: the whole MHC superregion (except for some tandem repeat regions) Goal: align this data, before doing variant/HLA call

  30. Diverse Variant “Density” in the MHC Superregion Based on a single sample

  31. Default BWA Alignment • No coverage at an exon of HLA-DMB

  32. Default BWA Alignment • Low coverage and orphaned reads at a HLA-DRB1 exon

  33. BWA vsMore PermissiveAlignment • HigherCoverage = HigherNoise

  34. Default BWA Alignment • Large targeted region without usable coverage

  35. NGS Reference-Based Not providing enough coverage everywhere What about de novo?

  36. De NovoAssembly(MIRA) • 287 contigs (longest contig: 2199 bp) • Mean contig size: 268 bp • Median contig size: 209 bp • Total consensus: 77084 bp • RainDancetarget: ~ 3800000 bp

  37. De NovoAssembly(MIRA)

  38. NGS De Novo Alignment Not enough contigs produced, not enough coverage of the target region What about a hybrid approach?

  39. De NovoAssemblywith “Backbone” • First, alignment to backbone, then de novo assembly • Backbone: 2220 contigs from HG19 chr 6 (sum: 3554852 bps) → almost whole RainDance target • Results: • Max reads / backbone contig: 197 • Max coverage: 71

  40. De NovoAssemblywith “Backbone”

  41. NGS Typing - Alignment Based We tried: • Burrows Wheeler aligner • More sensitive, seed and extend aligner • De novo aligner • 'Hybrid' de novo aligner • The variant/HLA call is only as good as the alignment • The alignments were not good enough

  42. NGS Database Base • Search against 'database' of known alleles • Such as IMGT/HLA database, available from EBI web site Stanford, Connexio, JSI Medical, BC Cancer Agency and Omixon have all tried this approach

  43. IMGT/HLA Database

  44. DB Based Approach Advantages • Less mapping headaches • Unambiguous results • Potential to be fast Difficulties • Novel allele detection • Homozygous alleles

  45. HLA Genotypingwith NGS – R.454 Reads

  46. HLA Genotypingwith NGS – IlluminaReads

  47. Results with ExomeData

  48. Exon LevelDetail

  49. Detailed Results- ShortReadPileup

  50. Conclusions • DB based approach to HLA typing is new but very promising • NGS approaches can resolve much of the ambiguity of Sanger SBT • DB based approach can also overcome the limitations of NGS reference-based alignment

More Related