Questions to be addressed

Questions to be addressed • Can multiple D genes be inserted? • Violation of 12/23 rule • Can D genes be inserted backwards? • Is there a D gene preference? • Is there a reading frame preference for D genes? • If yes, is it part of the gene rearrangement? • Who is doing the end trimming?

Data sets • 6329 clonally unrelated rearrangements. • 1968 un-mutated functional • 3707 mutated functional • 274 un-mutated non-functional • 380 mutated non-functional

P nucleotides

How many types of D genes? • Conventional D genes • Identified in 81% of sequences unmutated sequences, 64% of mutated sequences • Inverted D genes • Long inverted D genes can not be excluded • Two D genes • D genes with irregular RSS (DIR) • Chromosome 15 OR

D gene usage 27 conventional D genes, 34 known alleles

+ D-gene usage and JH gene • JH proximal D genes more often recombined to JH4 than JH6 and JH distal D genes more often to JH6

Inverted D genes are not used! (or used extremely infrequent) Inverted (palindrom) D genes

D genes with irregular RSS (recombinaation signal sequence) (DIR) • Very long, >180 bp • Contain a family 1 D gene >DIR1 (in between D6-6 og D1-7) GGTGTTCCGCTAGCTGGGGCTCACAGTGCTCACCCCACACC TAAAACGAGCCACAGCCTCCGGAGCCCCTGAAGGAGACCCC GCCCACAAGCCCAGCCCCCACCCAGGAGGCCCCAGAGCACA GGGCGCCCCGTCGGATTCTGAACAGCCCCGAGTCACAGTG GGTATAACTGGAACTAC >IGHD1-7-01|X13972|IGHD1-7-01|Homo sapiens|F|D-REGION GGTATAACTGGAACTAC

D genes with irregular RSS (DIR) • Very long, >180 bp • Contain a family 1 D gene • Found in 1% of sequences, inverted in 1.2% • Some explained as family 1 gene plus N additions • Median length of remaining not different from in permutated sequences • => No evidence for use of DIR

Two D genes • 2 D genes found in 1% of sequences • Frequency not different from permutated sequences • Some explained as one long D genes with deletion • Some not possible due to D genes location • Median lengths of longest gene resembles normal D genes, shortest resembles permutated sequences

Multiple D genes • 65 sequences with two D genes • Average length of shortest D genes: 11.6bp • Average length of longest D genes: 18.8bp • Average length of D genes in permuted sequences: 11.3bp • Average length of D genes in normal sequences: 17.8bp • => multiple D genes are not present!!! V-gene Longest-D Shortest-D J-gene

Chromosome 15 OR (open reading frames) • 10 OR resembling D genes on chromosome 15 • High homology to conventional D genes

>IGHD5-12-01|X13972|IGHD5-12-01|Homo sapiens|F|D- 275 aa vs. >IGHD5-OR15-5 |X55583 og X55584 253 aa 91.3% identity; Global alignment score: 1563 10 20 30 40 50 FINDFASTADTEMPLATESDATIGHD-SATSEPMNIELHMEPRECTSMNIELANTIBDYH :::::::::::::::::::::::::: ::::::::::::::::::::::::::::::::: FINDFASTADTEMPLATESDATIGHDRSATSEPMNIELHMEPRECTSMNIELANTIBDYH 10 20 30 40 50 60 60 70 80 90 100 110 MMDGENANALYSISIRIXRGANISMIPCMMANDLINEPARAMETERSSETTLARGVISAN :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: MMDGENANALYSISIRIXRGANISMIPCMMANDLINEPARAMETERSSETTLARGVISAN 70 80 90 100 110 120 120 130 140 150 160 170 AMELISTDEFALTISNAMEXEXCLDENAMESMFINDENTRYBYCMMNSBMATCHAFINDA :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: AMELISTDEFALTISNAMEXEXCLDENAMESMFINDENTRYBYCMMNSBMATCHAFINDA 130 140 150 160 170 180 180 190 200 210 220 230 LLENTRIESSNAMEMSTBEINSTARTFNAMENMBERFFASTAENTRIESREADFRMFILE :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: LLENTRIESSNAMEMSTBEINSTARTFNAMENMBERFFASTAENTRIESREADFRMFILE 190 200 210 220 230 240 240 250 260 270 DTEMPLATESDATGTGGATATAGTGGCTACGATTAC ::::::::::::: DTEMPLATESDAT----------------------- 250

Chromosome 15 OR (open reading frames) • 10 OR resembling D genes on chromosome 15 • High homology to conventional D genes • Very few OR15 in un-mutated sequences • Median length not different from hits in permutated sequences • => No evidence for use of OR15 genes

D gene reading frames • The recombination mechanism utilises each D gene reading frame at same frequency

N nucleotide dependence on end nucleotide Position X+1 Position X A T G C P-value A 0.2920.146 0.292 0.271 0.04 T 0.260 0.2900.207 0.243 0.016 G 0.204 0.172 0.4530.172 0.0004 C 0.136 0.204 0.231 0.430<0.0001 Expected 0.210 0.201 0.292 0.298 - N addition is not random but dependent on end nucleotide

Trimming of gene ends Avg. 3.8 bp • Trimming depends on the gene-end and can not only be described by a simple removal of one nucleotide at a time

VDJsolver performance Unmutated sequences #: p<0.01 §: P<0.001 Mutated sequences

Results regarding recombination and diversity and open questions • DIR, OR15, multiple D genes and VH replacements are not used at a significant rate • Inverted D genes are used rarely • All D genes not used at same frequency • What determines if a D genes is used? • D gene usage somewhat dependent on JH gene • Does multiple D-J recombination steps take place? • All D gene reading frames used at equal rate at the recombination step • At what step in the development happens the selection for the hydrophilic reading frame?

Results regarding recombination and diversity and open questions (cont.) • N addition not random but dependent on end nucleotide • Does nucleotide availability or the specificity of TdT determine the N addition? • Trimming not random but dependent on gene and sequence • What enzyme(s) is responsible for the trimming?

Numbering Schemes • The Kabat numbering scheme is a widely adopted standard for numbering the residues in an antibody in a consistent manner. However the scheme has problems! • The Chothia numbering scheme is identical to the Kabat scheme, but places the insertions in CDR-L1 and CDR-H1 at the structurally correct positions. This means that topologically equivalent residues in these loops do get the same label (unlike the Kabat scheme). • The IMGT unique numbering for all IG and TR V-REGIONs of all species relies on the high conservation of the structure of the variable region. This numbering, set up after aligning more than 5 000 sequences, takes into account and combines the definition of the framework (FR) and complementarity determining regions (CDR), structural data from X-ray diffraction studies, and the characterization of the hypervariable loops. http://www.bioinf.org.uk/abs/#kabatnum http://imgt.cines.fr/

Identification of CDR regions Indentifying the CDRs CDR-L1 Start Approx residue 24 Residue before is always C Residue after is always W. Typically WYQ, but also, WLQ, WFQ, WYL Length 10 to 17 residues CDR-L2 Start always 16 residues after the end of CDR-L1 Residues before generally IY, but also, VY, IK, IF Length always 7 residues CDR-L3 Start always 33 residues after end of CDR-L2 Residue before is always C Residues after always FGXG Length 7 to 11 residues CDR-H1 Start Approximately residue 31 (always 9 after a C) (Chothia/AbM defintion starts 5 residues earlier) Residues before always CXXXXXXXX Residues after always W. Typically WV, but also WI, WA Length 5 to 7 residues (Kabat definition); 7 to 9 residues (Chothia definition); 10 to 12 residues (AbM definition) CDR-H2 Start always 15 residues after the end of Kabat/AbM definition of CDR-H1 Residues before typically LEWIG, but a number of variations Residues after K[RL]IVFT[AT]SIA (where residues in square brackets are alternatives at that position) Length Kabat definition 16 to 19 residues (AbM definition and most recent Chothia definition ends 7 residues earlier; earlier Chothia definition starts 2 residues later and ends 9 earlier) CDR-H3 Start always 33 residues after end of CDR-H2 (always 3 after a C) Residues before always CXX (typically CAR)

Questions to be addressed