Leveraging Haplotype-Based Genomic Selection for Precision Plant breeding

Seminar-IV Maruthi Prasad B P III PhD PAMB 1066 Dept. of GPB, UASB

POPULATION INCREASE!!!! BIOTIC AND ABIOTIC STRESSES!!!! CLIMATE CHANGE!!!!

Genetic Variations Trait Improvement Environmental Resilience Efficient Breeding

Conventional breeding has made great success in the development of high-yielding crop varieties • It is important to accelerate the pace of crop improvement programmes especially for the complex traits such as yield under stress condition Varshney et al., 2005

Genomic selection Genotyping Improved variety SNP based GS Phenotyping SNP analysis Crop germplasm

Single Nucleotide Polymorphism • A Single Nucleotide Polymorphisms (SNP), pronounced “snips,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity • SNP: Single DNA base variation found >1% • Mutation: Single DNA base variation found <1% C T T A G CT T C T T A G CT T 94% 99.9% C T T A GTTT C T T A G T T T 0.1% 6% SNP Mutation

Mutations SNPs time present Mutations and SNPs Observed genetic variations Common Ancestor

Single Nucleotide Polymorphism • SNPs are found in • coding and (mostly) noncoding regions • Occur with a very high frequency • about 1 in 1000 bases to 1 in 100 to 300 bases • Easily automated • SNP is usually assumed to be a binary variable (repeat mutation at the same SNP locus is quite small) • SNPs close to particular gene can act as a marker for that gene • SNPs have become the preferred markers for association studies because of their high abundanceand high-throughput SNP genotypingtechnologies

Limitations of using SNPs directly in GS Reduced power of detection Undetected Epistatic Interaction Low Accuracy in GEBV Prediction Missing Desirable Rare Alleles 1 3 2 High Chances of False Positives and Negatives 4 • SNPs typically focus on single locus-trait associations, neglecting interactions between different genetic loci (epistasis) 5 • Inaccurate marker-trait associations and limited power lead to low accuracy in predicting GEBVs for genotype selection • Bi-allelic nature and low PIC value of SNPs • Reduce their power to detect MTAs accurately • Epistatic interactions play a crucial role in complex trait inheritance • Result in missed associations and reduced efficiency in selecting individuals with desired traits • Lead to slower genetic gain and inefficient breeding programs

“Leveraging Haplotype-Based Genomic Selection for Precision Plant breeding”

Outline of Presentation Introduction • Genomic selection using Haplotypes Haplotype based GS 01 02 Haplotype and tSNP selection methods 03 Advantages of Haplotype based GS 04 05 Case studies 06 Conclusion

Genomic selection? • Concept introduced by Haley and Visscher at 6th World Congress on Genetics Applied to Livestock Production at Armidale, Australia in 1998 • Term GS - Meuwissenet al., 2001 • Specialized form of MAS • A predictive rather than a design approach is likely to be effective for genetic improvement of traits controlled by a large number of small-effective QTLs

Process of Genomic Selection Train genomic selection model Phenotyping Training population Cross validation Genotyping Breeding population Genotyping Select Individuals GEBV estimation Mahanteshet al. (2022)

Steps involved in Haplotype based Genomic Selection 3. Haplotype block construction and Tag SNP selection 2. Genotyping Training and Breeding population Development of training population & Phenotyping 6. Estimation of GEBVs 7. Selection of individuals 4. Statistical model identification 5. Cross validation

Development of training population Germplasm set

Development of training population DH RILs MAGIC NAM Designed populations

Genotyping Training and Breeding population

Genotyping Training and Breeding population Depending on detection method and throughput • Low-throughput, hybridization-based markers : RFLPs • Medium-throughput, PCR-based markers: RAPD, AFLP, SSRs • High-throughput (HTP) sequence-based markers: SNPs • Dominant markers lower accuracy of GEBV prediction than co-dominant markers • Dense marker coverage to maximize the number of QTL • SNPs are one such marker system with high throughput and genome wide distribution

Genotyping Training and Breeding population Third generation sequencing: Alleviating the bottlenecks in haplotype identification NGS Technology TGS Technology

A G C T A T A T A C T G C G C G SNP1 SNP2 SNP1 SNP2 SNP1 SNP1 SNP2 SNP2 SNP1 SNP2 Haplotype data Genotype data Genotypes • Haplotype data is not easy to be obtained because the individual genome is a diploid • To obtain the haplotype data, we have to separate them first • In large sequencing projects,genotypesinstead of haplotypes are collected due to cost consideration

A G C T C A G T SNP1 SNP2 SNP1 SNP1 SNP2 SNP2 Genotype data A G A T C T C G SNP1 SNP2 SNP1 SNP2 Problems of Genotypes or We don’t know which haplotype pair is real • Genotypesonly tell us the alleles at each SNP locus • But we don’t know the connection of alleles at different SNP loci • There could be several possible haplotypes for the same genotype

High recombination T C A C G A C C A C T A G G A G A T T A Low recombination Conserved haplotype Low recombination Conserved haplotype Haplotype • A haplotype is a group of genes in an organism that are inherited together from a single parent in a defined order (Bevan et al., 2017) • These variants tend to be inherited together, often because they are very close together in the same chromosome region and therefore less likely to be separated by crossing over(Snowdon et al., 2015)

C T C Haplotype 1 -A C T TA G C T T- -A C T TT G C T C- C A T Haplotype 2 A T C -A A T TTG C T C- Haplotype 3 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 Haplotypes • In terms of SNP- • “Two or more SNP alleles that tend to be inherited as a unit” (Bernardo, 2010) • A haplotype stands for a set of linked SNPs on the same chromosome not easily separable by recombination • within each block, recombination is rare due to tight linkage

Haplotype patterns P1 P2 P3 P4 Recombinationhotspots S1 S2 S3 S4 Haplotypeblocks S5 SNP loci S6 S7 S8 S9 S10 S11 S12 Chromosome Hapmap : Minor allele : Major allele • The HapMap is a map of the haplotype blocksand specific SNPs that identify the haplotypes • The haplotype map or "HapMap" acts as tool to find genes and genetic variations that affect the trait expression.

Haplotype blocking Methods Weber et al., 2023 LD threshold Fixed windows of adjacent markers Fixed windows of adjacent base pairs HaploBlocker Confidence interval test or Gabriel algorithm (GAI) Four gamete test (GAM) Solid spine of linkage disequilibrium (SPI)

1. LD threshold Ex: LD threshold =0.8 E D F C A B If any marker pair fails to cross threshold will not be added to the block Highest LD Next marker pair with LD above threshold • Tolerance parameter of 1 is used • One marker that did not fulfill the LD threshold was accepted if the next flanking marker fulfilled the LD criterion • Haplotype blocks are built by identifying pairs of neighboring markers that exhibit a level of linkage disequilibrium (LD) above a specified threshold (0.01 to 1)

2. Fixed windows of adjacent markers • In an chromosome, haplotype blocks consisting of m neighboring markerswill be constructed until all markers on a chromosome were assigned to blocks 5 SNPs 5 SNPs 5 SNPs 5 SNPs m= 5 Chromosome • In the most extreme case, all markers of a chromosome represents a haplotype block containing all markers of that chromosome • Blocks of such large sizeare useful, where entire chromosomes or large segments play an important role in traits, as well as scenarios related to introgression breeding, where recombination is limited

3. Fixed windows of adjacent base pairs • In each chromosome, haplotype blocks of m consecutive base pairs will be constructed until the whole chromosome is partitioned into blocks • Similar to the adjacent markers approach

4. HaploBlocker • Itrelies on linkageinstead of linkage disequilibrium to construct haplotype blocks • Here blocks are defined as consecutive sequence of genetic markers with a predefined frequency • r package “HaploBlocker” Pook et al. (2019)

5. Confidence interval test or Gabriel algorithm (GAI) The reasons for allowing <5% of weak LD in the haplotype block is due to force like recurrent mutation, gene conversion, or errors of the genome assembly or genotyping in addition to recombination events Saad et al., 2018

6. Four gamete test (GAM) Haplotype block partitioning method that assumes recombination events are not allowed within each block Four gametes condition Three gamete condition Three gametes = No recombination- Haplotype block Four gametes = Recombination event occurred-No blocking • Rare gamete frequency > 0.01 to count a recombination event • Recombination events are only accepted between block

7. Solid spine of linkage disequilibrium (SPI) • Strong LD between the first SNP and the last SNP and with all the intermediate SNPs is observed • Scenario where, a SNP marker exhibits strong and consistent associations with surrounding SNPs, indicating the presence of a stable haplotype block • The solid spine is a line of strong LD >0.8 that moves from one allele to next along the legs of the triangle, which defines particular haplotype

Haplotype Inference/Phasing Diploids vs Haploids Chr1 Chr2 Chr1 Chr2 Diploid cell Haploid cell

Haplotype Inference/Phasing Homo vs. Hetero Chr1 Chr2 Chr1 Chr2 Heterozygous Homozygous

Haplotype Inference/Phasing Problem of Phase G A Chr1 C T SNP1 SNP2 Observed: SNP1 G / T SNP2 A/C Possible Haplotypes: GA, TC or GC, TA

G G G G G G G A A A A A A A C C C C C T T T T T T T T T A A A A Haplotype Inference/Phasing • Not all combinations will be present in individuals How to resolve this problem of phase ?

Haplotype Inference/Phasing • The problem of inferring the haplotypes from a set of genotypes is called haplotype inference • Most combinatorial methods consider the maximum parsimonymodel to solve this problem • This model assumes that the real haplotypes in natural population is rare • The solution of this problem is a minimum set of haplotypesthat can explain the given genotypes

A G A T A A G h3 h1 G1 T C A C T C G h4 h2 SNP1 SNP2 A T h1 T G2 A T T h1 SNP1 SNP2 A T A G C G C T A T Maximum Parsimony or Find a minimum set of haplotypesto explain the given genotypes

Different phasing methods for haplotype construction/ reconstruction Reference-based phasing De novo genome assembly (such as diploid and polyploid assembly) Strain-resolved metagenome assembly (de novo re-assembly, single nucleotide variant-based assembly, read and contig binning)

Problems of Genotyping all SNPs of Haplotype for Association Studies • LD-based haplotyping with subsequent tag SNP selection improved the genomic prediction accuracy up to 0.07 and 0.092 for Fusarium head blight resistance and spike width, respectively, across six different models • Pre-selection of SNPs via LD-based haplotype-tagging could play a vital role in optimizing genomic selection and reducing genotyping costs • The number of SNPs is still too large to be used for Genomic selection • There are millions of SNPs in a plant genome • To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies • Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs

Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 • Suppose we wish to distinguish an unknown haplotype sample • We can genotype all SNPs to identify the haplotype sample S2 S3 S4 S5 S6 SNP loci S7 S8 S9 : Major allele S10 S11 : Minor allele S12

Examples of Tag SNPs Haplotype pattern • In fact, it is not necessary to genotype all SNPs • SNPs S3, S4, and S5 can form a set of tag SNPs P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S8 S3 S9 S4 S10 S5 S11 S12

Examples of Wrong Tag SNPs Haplotype pattern P1 P2 P3 P4 • SNPsS1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous S1 S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12

Examples of Tag SNPs Haplotype pattern • SNPs S1 and S12 can form a set of tag SNPs • This set of SNPs is the minimum solution in this example P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 SNP loci S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12

Steps for ‘tag SNP’ selection Halldorssonet al., 2004 (1) Determining predictive neighborhoods (2) Minimizing the number of tagging SNPs (3) Tagging quality assessment

Haplotype Blocks and Tag SNPs • Recent studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by recombination hotspots • Within a haplotype block, there is little or no recombination occurred • The SNPs within a haplotype block tend to be inherited together • Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block • We only need to genotype tag SNPs instead of all SNPs within a haplotype block

S3 S4 S2 There are pairs of patterns Problem Formulation P1 P2 P3 P4 • The relation between SNPs and haplotypes can be formulated as a bipartite graph • S1can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4) • S2 can distinguish (P1, P4), (P2, P4), (P3, P4) S1 S2 S3 S4 S1 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)

P1 P2 P3 P4 S1 S2 S3 S1 S3 S4 S2 Observation • The SNPs can form a set of tag SNPs ifeach pair of patterns is connected by at least one edge • e.g., S1and S3can form a set of tag SNPs • e.g., S1 and S2 can not be tag SNPs (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is connected by at least one edge.

Methods to select tSNPs Covariance matrix of SNPs Principal components analysis SNPs contribute most to eigenvectors & associated with the largest eigenvalues are considered as more influential Selected SNPs added to the set of tagging SNPs Based on Principal Component Analysis (PCA) to reduce the dimensions of complete sets of SNPs

Leveraging Haplotype-Based Genomic Selection for Precision Plant breeding