0 likes | 161 Views
Haplotype led approaches in Plant breeding
E N D
Seminar-II Maruthi Prasad B P II PhD PAMB 1066 Dept. of GPB, UASB
POPULATION INCREASE!!!! Biotic and Abiotic stresses!!!! CLIMATE CHANGE!!!!
Conventional breeding has made great success in the development of high-yielding crop varieties • It is important to accelerate the pace of crop improvement programmes especially for the complex traits such as yield under stress condition Varshney et al. 2005
Genetic Variations Trait Improvement Environmental Resilience Efficient Breeding
Marker allelic variations within a genome of a same species 1. Single nucleotide polymorphisms – SNPs 2. Segmental/nucleotide insertions/deletions - InDels 3. Differences in the number of tandem repeats at a locus – SSRs SSR ACTGTCGACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGCGATCGA ACTGTCGACACACACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGTGTGTGCGATCGA ACTGTCGACACACACACACACACACACACACACGCTAGCT TGACAGCTGTGTGTGTGTGTGTGTGTGTGTGTGCGATCGA InDels CATCGCGAATTCCCATCG GTAGCGCTTAAGGGTAGC CATCG----------------CATCG GTAGC----------------GTAGC SNP GAATTC CTTAAG GAACTC CTTGAG Mammadov et al., 2012
Targeting genetic variants associated with agronomic traits and identifying important underlying candidate genes have become a key area in crop genetic research Depending on detection method and throughput • Low-throughput, hybridization-based markers : RFLPs • Medium-throughput, PCR-based markers: RAPD, AFLP, SSRs • High-throughput (HTP) sequence-based markers: SNPs
Single Nucleotide Polymorphism • A Single Nucleotide Polymorphisms (SNP), pronounced “snips,” is a genetic variation when a single nucleotide (i.e., A, T, C, or G) is altered and kept through heredity. • SNP: Single DNA base variation found >1% • Mutation: Single DNA base variation found <1% C T T A G CT T C T T A G C T T 99.9% 94% C T T A G T TT C T T A G T T T 0.1% 6% SNP Mutation
Mutations SNPs time present Mutations and SNPs Observed genetic variations Common Ancestor
Single Nucleotide Polymorphism A SNP is usually assumed to be a binary variable The probability of repeat mutation at the same SNP locus is quite small The tri-allele cases are usually considered to be the effect of genotyping errors The nucleotide on a SNP locus is called a major allele (if allele frequency > 50%) a minor allele (if allele frequency < 50%) A C T T A G C T T T: Major allele 94% C: Minor allele A C T T A G C T C 6%
Single Nucleotide Polymorphism • SNPs are found in • coding and (mostly) noncoding regions • Occur with a very high frequency • about 1 in 1000 bases to 1 in 100 to 300 bases • Easily automated • SNPs close to particular gene can act as a marker for that gene • SNPs have become the preferred markers for association studies because of their high abundance and high-throughput SNP genotyping technologies.
A G C T A T A T AC GT C G C G SNP1 SNP2 SNP1 SNP2 SNP1 SNP2 SNP1 SNP2 Haplotype data Genotype data Genotypes • The use of haplotype information has been limited because the individual genome is a diploid. • To obtain the haplotype data, we have to separate them first • In large sequencing projects, genotypesinstead of haplotypes are collected due to cost consideration.
A G C T AC GT SNP1 SNP2 SNP1 SNP2 Genotype data A G A T C T C G SNP1 SNP2 SNP1 SNP2 Problems of Genotypes or We don’t know which haplotype pair is real • Genotypesonly tell us the alleles at each SNP locus • But we don’t know the connection of alleles at different SNP loci • There could be several possible haplotypes for the same genotype
“Haplotype-led approaches for increasing precision in plant breeding”
Outline of Presentation Introduction Haplotype construction and Inference 01 Haplotype Mapping 02 Tag SNPs & Methods to select tSNPs. 03 Application of Haplotype led approaches in Plant Breeding 04 05 Case studies . 06 Conclusion
alleles Haplotype locus String of SNPs that are linked/co-inherit tegether Polymorphic frozen blocks haplotypes A haplotype is a group of genes in an organism that are inherited together from a single parent in a defined order (Bevan et al., 2017) These variants tend to be inherited together, often because they are very close together in the same chromosome region and therefore less likely to be separated by crossing over(Snowdon et al., 2015)
C T C Haplotype 1 -A C T TA G C T T- -A C T TT G C T C- C A T Haplotype 2 A T C -A A T TT G C T C- Haplotype 3 SNP1 SNP2 SNP3 SNP1 SNP2 SNP3 Haplotypes • In terms of SNP- • “Two or more SNP alleles that tend to be inherited as a unit” (Bernardo, 2010) • A haplotype stands for a set of linked SNPs on the same chromosome not easily separable by recombination • within each block, recombination is rare due to tight linkage
Haplotype blocks Recombination Hotspots and Haplotype Blocks • Haplotype blocks are defined as a contiguous series of SNPs and appearing to have very little evidence of historical recombination among the individuals (Gabriel et al., 2002)
Haplotype patterns P1 P2 P3 P4 Recombinationhotspots S1 S2 S3 S4 : Major allele Haplotypeblocks S5 SNP loci S6 : Minor allele S7 S8 S9 S10 S11 S12 Chromosome Recombination Hotspots and Haplotype Blocks
A Haplotype Block Example • The Chromosome 21 of humans is partitioned into 4,135 haplotype blocks over 24,047 SNPs by Patil et al. (Science, 2001). • Blue box:major allele • Yellow box:minor allele
Hapmap Source: The International Hapmap Project • The HapMap is a map of the haplotype blocks and specific SNPs that identify the haplotypes • The haplotype map or "HapMap" acts as tool to find genes and genetic variations that affect the trait expression.
Steps in hapmap construction Third generation sequencing: Alleviating the bottlenecks in haplotype identification NGS Technology TGS Technology
Different phasing methods for haplotype construction/ reconstruction Reference-based phasing De novo genome assembly (such as diploid and polyploid assembly) Strain-resolved metagenome assembly (de novo re-assembly, single nucleotide variant-based assembly, read and contig binning)
Haplotagging-A novel sequencing strategy for rapid discovery of haplotypes
Steps in hapmap construction SNPs are identified in DNA samples from multiple individuals Adjacent SNPs that are inherited together are compiled into haplotypes “Tag” SNPs are identified within haplotypes that uniquely describe those haplotypes Source: The International Hapmap Project
Haplotype blocking Saad et al., 2018 Confidence interval test Four gamete test Solid spine of linkage disequilibrium
Confidence interval test The reasons for allowing <5% of weak LD in the haplotype block is due to force like recurrent mutation, gene conversion, or errors of the genome assembly or genotyping in addition to recombination events Saad et al., 2018
Four gamete test Haplotype block partitioning method that assumes recombination events are not allowed within each block Four gametes condition Three gamete condition Three gametes = No recombination- Haplotype block Four gametes = Recombination event occurred-No blocking • Rare gamete frequency > 0.01 to count a recombination event • Recombination events are only accepted between blocks
Solid spine of linkage disequilibrium • Strong LD between the first SNP and the last SNP and with all the intermediate SNPs is observed Scenario where a SNP marker exhibits strong and consistent associations with surrounding SNPs, indicating the presence of a stable haplotype block The solid spine is a line of strong LD >0.8 that moves from one allele to next along the legs of the triangle. Which defines particular haplotype
Comparison among haplotype blocking methods The FGT method differs from other methods as it does not require threshold for LD Qian et al., 2017
Haplotype Inference • The problem of inferring the haplotypes from a set of genotypes is called haplotype inference. • Most combinatorial methods consider the maximum parsimony model to solve this problem. • This model assumes that the real haplotypes in natural population is rare • The solution of this problem is a minimum set of haplotypesthat can explain the given genotypes
A G A T A A G h3 h1 G1 T A C C T C G h4 h2 SNP1 SNP2 A T h1 T G2 A T T h1 SNP1 SNP2 A T A G C G C T A T Maximum Parsimony • Find a minimum set of haplotypesto explain the given genotypes. or
Factors affecting haplotype map construction Hamblin & Jannink, 2011 SNP allele frequency distribution Haplotype allele numbers Linkage disequilibrium (LD)
Problems of Using SNPs for Association Studies • The number of SNPs is still too large to be used for association studies • There are millions of SNPs in a plant genome • To reduce the SNP genotyping cost, we wish to use as few SNPs as possible for association studies • Tag SNPs are a small subset of SNPs that is sufficient for performing association studies without losing the power of using all SNPs.
Brief glossary of terms Halldorssonet al., 2004
Examples of Tag SNPs Haplotype patterns An unknown haplotype sample P1 P2 P3 P4 S1 • Suppose we wish to distinguish an unknown haplotype sample • We can genotype all SNPs to identify the haplotype sample S2 S3 S4 S5 S6 SNP loci S7 S8 S9 : Major allele S10 S11 : Minor allele S12
Examples of Tag SNPs Haplotype pattern • In fact, it is not necessary to genotype all SNPs • SNPs S3, S4, and S5 can form a set of tag SNPs P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S8 S3 S9 S4 S10 S5 S11 S12
Examples of Wrong Tag SNPs Haplotype pattern P1 P2 P3 P4 • SNPsS1, S2, and S3 can not form a set of tag SNPs because P1 and P4 will be ambiguous S1 S2 S3 S4 S5 S6 SNP loci P1 P2 P3 P4 S7 S1 S8 S2 S9 S3 S10 S11 S12
Examples of Tag SNPs Haplotype pattern • SNPs S1 and S12 can form a set of tag SNPs • This set of SNPs is the minimum solution in this example P1 P2 P3 P4 S1 S2 S3 S4 S5 S6 SNP loci S7 S8 P1 P2 P3 P4 S9 S1 S10 S12 S11 S12
Steps for ‘tag SNP’ selection Halldorssonet al., 2004 (1) Determining predictive neighborhoods (2) Minimizing the number of tagging SNPs (3) Tagging quality assessment
Haplotype Blocks and Tag SNPs • Recent studies have shown that the chromosome can be partitioned into haplotype blocks interspersed by recombination hotspots • Within a haplotype block, there is little or no recombination occurred. • The SNPs within a haplotype block tend to be inherited together • Within a haplotype block, a small subset of SNPs (called tag SNPs) is sufficient to distinguish each pair of haplotype patterns in the block • We only need to genotype tag SNPs instead of all SNPs within a haplotype block
S3 S4 S2 There are pairs of patterns. Problem Formulation P1 P2 P3 P4 • The relation between SNPs and haplotypes can be formulated as a bipartite graph • S1can distinguish (P1, P3), (P1, P4), (P2, P3), and (P2, P4) • S2 can distinguish (P1, P4), (P2, P4), (P3, P4) S1 S2 S3 S4 S1 (1,2) (1,3) (1,4) (2,3) (2,4) (3,4)
P1 P2 P3 P4 S1 S2 S3 S1 S3 S4 S2 Observation • The SNPs can form a set of tag SNPs ifeach pair of patterns is connected by at least one edge • e.g., S1 and S3 can form a set of tag SNPs • e.g., S1 and S2 can not be tag SNPs (1,2) (1,3) (1,4) (2,3) (2,4) (3,4) Each pair of patterns is connected by at least one edge.
Methods to select tSNPs Covariance matrix of SNPs Principal components analysis SNPs contribute most to eigenvectors & associated with the largest eigenvalues are considered as more influential Selected SNPs added to the set of tagging SNPs Based on Principal Component Analysis (PCA) to reduce the dimensions of complete sets of SNPs
Methods to select tSNPs • Shannon entropy: Based on defining how well a subset of SNPs captures the variation in the complete set • Shannon entropy helps us quantify how much genetic diversity a particular SNP captures SNP has high entropy→ It comes with different versions of alleles → Reflecting greater diversity→ tSNP is selected SNP has low entropy → Most individuals have same version of alleles → Less diversity →Less informative
Linkage Disequilibrium • The problem of finding tag SNPs can be also solved from the statistical point of view • We can measure the correlation between SNPs and identify sets of highly correlated SNPs • For each set of correlated SNPs, only one SNP need to be genotyped and can be used to predict the values of other SNPs • Linkage Disequilibrium (LD)is a measure that estimates such correlation between two SNPs
A B a B a b Introduction to Linkage Disequilibrium • PAB≠ PAPB • PAb≠PAPb = PA(1-PB) • PaB≠PaPB = (1-PA) PB • Pab≠PaPb = (1-PA) (1-PB) A b SNP2 SNP1 SNP2 SNP1
Linkage Disequilibrium Formulas • Mathematical formulas for computing LD or Correlation: • r2 or Δ2:
Linkage Disequilibrium Bins • The statistical methods for finding tag SNPs are based on the analysis ofLDamong all SNPs • An LD bin is a set of SNPs such that SNPs within the same bin are highly correlated with each other • The value of a single SNP in one LD bin can predict the values of other SNPs of the same bin • These methods try to identify the minimum set of LD bins
An Example of LD Bins (1/3) • SNP1 and SNP2 can not form an LD bin • e.g., A in SNP1 may imply either G or A in SNP2