Positional cloning of the Huntington’s disease (HD) gene

Positional cloning of the Huntington’s disease (HD) gene Mapping and cloning of the HD gene chromosome walking cDNA libraries Identifying the disease-causing mutations Studies of the HD gene: identifying orthologous proteins (BLAST) mouse knockouts (KO’s) transgenic mice Summary of other repeat expansion diseases

Goals for the next three lectures… -Try to fill in some gaps -Strengthen the connections between topics -Some new information: protein similarity (probably today & Monday) knockout mice (probably Monday) population genetics (Monday?) -Next Fridays lecture: no more than 30 minutes of new material course evaluations (~15-20 minutes) review/problem solving/QS10

If I do spend time reviewing topics on Friday it would be good to know what you need help with: -No more than 1-2 topics (1-2 sentences) -Send to: pallanck@u.washington.edu -Need to hear from you before Monday Solutions to Problem set 6 have been posted on the course website Lastly: If you feel that an error was made in the grading of your 2nd midterm exam, send an email message to Anne Paul summarizing the error, BY THE END OF THE DAY TODAY.

Huntington’s disease results from nerve cell degeneration in the basal ganglia HD brain Normal brain Huntington’s Disease (reminder from lecture 13) • A dominant genetic disease; affects ~ 8 people per/100,000 worldwide • Symptoms include abnormal body movements (chorea), cognitive decline, death • Symptoms result from neurodegeneration • Age of onset typically 40’s; ranges from infancy to elderly • Genetic anticipation (increasing disease severity in subsequent generations) often observed • No cure or treatment

Mapping of the Huntington’s disease gene The informative pedigree: • 5,000 related individuals from Venezuela segregating HD • Included 100 members currently affected by HD • Included >1,000 members with >25% risk The markers used: Few markers available so tested random, purified fragments of human genome Used these random fragments as probes to conduct Southern blot analysis to identify RFLPs On their 12th probe… the jackpot! - linkage of the RFLP to HD! 1983

max = 3cM 3 -2 Marker ‘D4S10’ shows linkage to HD 40 Results of linkage studies using the probe “G8” which recognizes the RFLP marker D4S10 30 20 10 LOD score (Z)  -10 -20 Does this result provide significant evidence of linkage? -30 -40

FISH telomere D4S10 (4p16) centromere Where is marker ‘D4S10’ located? Karyotype: ? HD gene ~3cM away from D4S10 How to tell? ~3x106bp

1983 1992

derived genotypes D4S141 1 C B 1 1 2 3 B 2 A C 2 2 3 0 A 2 C (B/C) 1 1 3 3 A 1 B B 1 1 3 5 A D4S115 D4S141 D4S141 D4S111 D4S115 D4S115 Y1P18 1 C B 2 2 3 0 A 1 B B 1 1 3 5 A (1/2) A (B/C) 2 2 3 3 B 1 C (B / 1 1 3 0 A 2 B C) 1 1 3 5 C D4S111 D4S111 R10 Y1P18 Y1P18 D4S98 R10 R10 D4S43 D4S98 D4S98 D4S10 D4S43 D4S43 D4S10 D4S10 HD HD gene likely to reside here HD 2 A (C / 2 2 3 3 B 1 B B) 1 1 3 5 C HD gene likely to reside here Narrowing of the HD region Looking for highly informative recombinants (haplotypes): telomere D4S141 D4S115 D4S111 Y1P18 HD ? R10 D4S98 Where are the informative recombinants? D4S43 D4S10 centromere What Next?

D4S180 D4S182 we know the sequence here A portion of the UCSC Genome browser window: and here What is the DNA sequence in this interval? use in a BLAST search CCAACTGACTTAAGC…………………….AGCCTAGCTTAGATGC Genetic and physical map of the HD region D4S10 D4S98 HD ? centromere telomere ~500kb If 2008: D4S180: AACTGACTTAA D4S182: CCTAGCTTAGAT We could also find the genes in this interval using the UCSC browser But it was 1992…

D4S180 D4S182 we know the sequence here A portion of the UCSC Genome browser window: and here What is the DNA sequence in this interval? If 2008: But it was 1992… Genetic and physical map of the HD region D4S10 D4S98 HD ? centromere telomere ~500kb D4S180: AACTGACTTAA D4S182: CCTAGCTTAGAT How was this done in 1992 (i.e., before the genome was sequenced)?

Identify D4S180 & D4S182-containing clones in genomic DNA library Use ends of those clone’s inserts to find other clones with overlapping inserts Repeat Chromosome walking (outline) Make radioactive probes from known sequence

replica on filter release the DNA bind it to filter which colonies match up with hyb spots? hyb * probe from D4S180 region * * Colony hybridization to find the first genomic DNA clone genomic DNA clones X-ray film

Colony hybridization (cont’d) The colonies you detect must have insert sequences complementary to your D4S180 probe! What next? • Pick one of these clones • Characterize it (restriction digest, etc.) • Make a probe from one end of its insert • Repeat colony hybridization

colony hyb Chromosome walking — finding the next clone Pick one end of the insert PCR amplify the region Label the PCR fragment with radioactive tag Amp end ori end The goal — find the colonies (clones) that contain this sequence  overlap your first clone

Colony hybridization (cont’d) The colonies you detect in the hybridization could have… - duplicates of your original plasmid - new plasmids with different (but overlapping) inserts How could you tell if they were the same as the original? Restriction digests or sequencing

identified using D4S182 probe identified using D4S180 probe probe Joining fragments HD gene a contig Assembling a contig Repeat the process until the clones obtained from the flanking markers join: insert in original clone probe insert in original clone

24 62 17 54 20 9 19 36 4 STS For example: Which portion of the genome is represented in this BAC’s insert? From lecture 13 II. Map location on genome STS = sequence tagged site… short, unique genomic sequence—not present anywhere else in the genome— that can be detected by PCR… ID tag for that portion of genome Test the BAC by PCR: Does it test positive* with PCR primers for STS 24? Does it test positive with PCR primers for STS62? …etc. *Test positive? What does that mean?

D4S180 D4S182 ~40kb each Cosmid (sort of like a plasmid) contig Genetic and physical map of the HD region D4S10 D4S98 HD ? centromere telomere ~500kb How do we identify the genes in a contig? Which one is the HD gene?

These are things that computers are great at-and some of the things that underlie the UCSC browser Look for signatures of genes—e.g., promoters Look for open reading frames Look for transcribed regions—e.g., make a cDNA library Identifying genes in DNA sequence Various approaches… ...TTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGG... ...AACTTCTGCTTTCCCGGAGCACTATGCGGATAAAAATATCCAATTACAGTACTATTATTACCAAAGAATCTGCAGTCCACCGTGAAAAGCCC...

AAAAAAA-3’ 5’ added by cell during pre-mRNA maturation insert into plasmid, transform E. coli Making a cDNA library cDNA = complementary DNA complementary to mRNA Start with mRNA from a cell culture or tissue Copy into DNA using reverse transcriptase and poly-A tail One mRNA out of the pool shown here… TTTTTTT-5’

Genomic vs. cDNA libraries cDNA library make cDNA, insert into plasmid, etc. • only mRNA regions (exons) represented • frequency of clone proportional to amount of transcription of the gene

D4S180 D4S182 ~40kb each Used as probes to screen cDNA libraries Cosmid (sort of like a plasmid) contig IT-15 IT-11 IT-10C3 ADDA Genetic and physical map of the HD region D4S10 D4S98 HD ? centromere telomere ~500kb Which (if any) of these transcripts correspond to the HD gene?

Why wouldn’t all individuals of a genotype show the same phenotype? How was the HD gene identified? Compared sequences from normal and HD individuals Look for gene alterations specific to diseased population Focused on genes that are expressed in the nervous system Screened cDNA libraries prepared from normal brain mRNA Some potential complications: -non-disease causing (rare) polymorphisms distinguishing the diseased and normal population -incomplete penetrance -variable expressivity -Influence of other genes—many traits multigenic -Influence of environment -Observation errors!

CAG18 CAG21 A simple PCR test to measure CAG repeat length in IT-15: Unique sequences in IT-15 flanking the CAG repeat CAGn GTCn How was the HD gene identified? HD ? IT-15 IT-11 IT-10C3 ADDA Gene: 67 exons; >200 kb mRNA: 10,366 bases Protein: 3,144 aa; ~350kDa

HD Correlation of HD age of onset and CAG repeat length 100 42-86 CAG repeats in >150 HD individuals 80 11-34 CAG repeats in 173 normals 60 Onset age (years) 40 (98% between 11-24) 20 0 30 40 50 60 70 80 90 100 CAG repeat length How was the HD gene identified? Triplet repeat number normal 65 50 35 20 5 Further evidence that CAG repeat expansion mutation is the cause of HD: -Two HD patients with a new mutation (not seen in parents) also had a repeat expansion. -Length of repeat correlated with onset and severity.

CAG11-34 CAG42-?? disease allele IT-15 is the HD gene (AKA Huntingtin) HD ? IT-15 IT-11 IT-10C3 ADDA non-disease allele Gene: 67 exons; >200 kb mRNA: 10,366 bases Protein: 3,144 aa; ~350kDa

Why did it take so long to clone the HD gene? 1979-work begins to clone HD 1983-First marker linked to HD (a lucky break) 1993-HD gene cloned -There were very few markers for linkage studies in humans -There were several inconsistencies in the linkage data -The biology of HD was of limited help in selecting candidate genes (~60% of mRNAs transcribed in the brain) -It is not easy to identify disease causing mutations! "We applaud their discovery," adds another contender, Michael Hayden of the University of British Columbia, who found himself in the painful position of having proposed a different candidate HD gene in Nature the day before the consortium published their proof-positive results in Cell. -Virginia Morell (1993) Science 260, 28-30.

Onset at 2yrs Too young to show trait Onset in early 40’s Triplet repeat number 120 Expanded CAG repeats are unstable in the paternal germline 90 65 50 35 20 5 Repeat instability explains HD genetic anticipation in HD CAG repeats tend to expand upon paternal transmission:

A C G CAGCAGCAGCAGCAG 5’ GTCGTCGTCGTCGTCGTC TCGTCGTC G 3’ OR, less frequently CAGCAGCAGCAGCAGCAG 5’ GTCGTCGTCGTCGTCGTCGTC TC G 3’ G C T Why are long CAG repeats unstable? DNA polymerase A molecular model: CAGCAGCAGCAGCAGCAG 5’ GTCGTCGTCGTCGTCGTCGTC TCGTC G 3’ increases CAG repeat length by 1 CAG decreases CAG repeat length by 1 CAG

CAGCAGCAGCAGCAGCAGCAACAGCAA 5’ 3’ GTCGTCGTCGTCGTCGTCGTTGTCGTC 3’ 5’ CAGCAGCAGCAGCAGCAGCAGCAGCAA 5’ 3’ GTCGTCGTCGTCGTCGTCGTCGTCGTC 3’ 5’ Prone to expansion? Why are only long CAG repeats unstable? Short repeats often also contain some CAA codons: CAGCAGCAACAGCAGCAGCAACAGCAA 5’ 3’ GTCGTCGTTGTCGTCGTCGTTGTCGTC 3’ 5’

How do mutations in Huntingtin cause disease? HD is a dominant disorder: Given what you know about dominant mutations, provide possible genetic explanations for the HD phenotype. -Haploinsufficiency-half the amount of HD gene product insufficient (like W)? -Dominant negative-poison subunit (like rab27b)? -Expressed in wrong place (like Antennapedia) or wrong time (like lactase) -Protein with a new activity (like the ABO blood antigens)?

Wolf-Hirschhorn Syndrome (4p-) (The Human “Knockout” of the Huntington Locus ) • Microdeletion (contiguous gene deletion) syndrome • Growth retardation, with abnormal facies. • Cardiac, renal, and genital abnormalities. • Significantly, basal ganglia is intact; no movement disorder Rules out haploinsufficiency as cause of Huntington’s disease

How do mutations in Huntingtin cause disease? HD is a dominant disorder: Given what you know about dominant mutations, provide possible genetic explanations for the HD phenotype. -Haploinsufficiency-half the amount of HD gene product insufficient (like W)? -Dominant negative-poison subunit (like rab27b)? -Expressed in wrong place (like Antennapedia) or wrong time (like lactase) -Protein with a new activity (like the ABO blood antigens)?

-more mismatches are tolerated if appropriate hybridization conditions are met (salt and temperature). Allows non-identical, but closely-related sequences to hybridize. okay Does CAG expansion act in a dominant-negative fashion? If the repeat expansion in HD acts in a dominant-negative fashion, a homozygous LoF mutation should be equivalent But no homozygous LoF alleles of the HD gene have been seen in humans! Perhaps we can create mutations in the mouse HD gene! But, how do we find the mouse HD gene?

Colony hybridization with a human HD probe ultimately led to the identification of the mouse HD gene Human HD protein 3,144 aa Mouse HD protein 3,120 aa The two proteins match at >90% of their aa’s! If the sequences are conserved, the biological function is also likely to be conserved If the biological function is conserved, we can test whether a mouse bearing a homozygous HD lof mutation resembles the human disease Before continuing, let’s diverge and consider how this is done today-in some detail…(BLAST) -But we will focus on using BLAST to find similar proteins (unlike what you did in QS)

doubles in size about every 2 years! Finding the mouse HD gene computationally We need three things: A sequence database. Some way of saying how similar two sequences are. A really fast way of carrying out the similarity test. We have the genome sequences and gene structures already. We’ll diverge from HD for a bit and talk about point 2 now. Point 3 is more appropriate for a computer course. The method is called BLAST (basic local alignment search tool). You should be at least somewhat familiar with this from QS9.

Thinking about protein similarity Suppose we have the following aligned protein sequences: amino acid identities PWAVTASCH ||||||||| VYAVQASPH (human) amino acid identities (something else) PWAVTASCH ||||||||| PWGVHATCW (human) (something else) We can see that both of the “something else” sequences appear to be related to the human. But related to what extent? We need to be quantitative.

Amino acid structures Hydrophobic Polar Charged phenylalanine F

Amino acid frequency Amino acid frequencies in the entire universe of known protein sequences. common rare

log odds calculation likelihood of seeing amino acid pair in related protein log score = likelihood of seeing amino acid pair at random likelihood of seeing amino acid pair in related protein: • Related proteins taken from BLOCKS database (validated related proteins). • Simply count up how often a particular amino acid pair is seen. • Gives you the numerator likelihood above. likelihood of seeing amino acid pair at random: At random: (the factor of two is because it can be an A-B pair or a B-A pair) • Gives the denominator likelihood above.

21 (# D-D pairs) 74 (total # of pairs) From aa frequency table 0.05 X 0.05 (f of D-D) Amino acid pair frequencies in related proteins One block from BLOCKS database: CKS2_XENLA|Q91879 NIYYSDKYTDEHFEY CKS1_HUMAN|P33551 QIYYSDKYDDEEFEY CKS2_HUMAN|P33552 QIYYSDKYFDEHYEY CKS2_MOUSE|P56390 QIYYSDKYFDEHYEY CKS1_PATVU|P41384 QIYYSDKYFDEDFEY CKS1_DROME|Q24152 DIYYSDKYYDEQFEY CKS1_PHYPO|P55933 TIQYSEKYYDDKFEY CKS1_LEIME|Q25330 KILYSDKYYDDMFEY O23249 QIQYSEKYFDDTFEY O60191 NIHYSTRYSDDTHEY CKS1_SCHPO|P08463 QIHYSPRYADDEYEY CKS1_YEAST|P20486 SIHYSPRYSDDNYEY CKS1_CAEEL|Q17868 DFYYSNKYEDDEFEY D-D 21 pairs D-E 14 pairs D-P 14 pairs D-T 7 pairs D-N 7 pairs E-E 1 pair E-T 2 pairs E-P 4 pairs T-P 2 pairs T-T 1 pair T-N 1 pair LOD calcul. (e.g., D-D pair): One of 29,068 blocks - pair frequencies compiled from all blocks combined. log

log odds scores (side note) • Traditionally, we use log base 2 (pedigree LOD scores are base 10). • To make computing fast, scores are usually multiplied by 2 and then rounded to nearest integer (this is a detail). • Called “half-bit” scores (jargon for taking twice log base 2).

Log2 1 = 0 Log2 2 = 1 Log2 1/2 = -1 Remember: log odds scores (cont.) likelihood of seeing amino acid pair in related protein log score = likelihood of seeing amino acid pair at random If amino acid pair seen MORE often than expected at random? • odds > 1, score positive If amino acid pair seen LESS often than expected at random? • odds < 1, score negative

Values from a score matrix (half-bit scores) one-letter amino acid code score for alanine (A) - tryptophan (W) self match scores

Amino acid structures Hydrophobic Polar Charged phenylalanine F

I-L Example - similar amino acids get positive scores Qualitatively, what scores do you expect pairs of these to have? I-V L-V

Example - dissimilar amino acids get negative scores Qualitatively, what scores do you expect pairs among these groups to have? hydrophobic vs. charged

Positional cloning of the Huntington’s disease (HD) gene

Positional cloning of the Huntington’s disease (HD) gene

Presentation Transcript

Primer design for PCR cloning

Cloning into Plasmids

Cloning and Sequencing

Huntington’s Disease

Ch. 20 Biotechnology

HUNTINGTON'S DISEASE

positional cloning of human disease genes: a reversal of scientific priorities

Huntington’s Disease

Huntington’s Disease

Huntington’s Disease

Huntington Disease

Huntington’s Disease

Cloning

Accelerating positional cloning in mice using ancestral haplotype patterns

Molecular cloning

Genetic Counselling and Predictive testing for Huntington’s Disease

Huntington’s Disease

Genetic Cloning

Biology 423 Research Paper: Genetics behind cloning of a human gene:

Cloning and Sequencing

Huntington’s Disease

Huntington’s Chorea