550 likes | 697 Views
Genomics: Looking at Life in New Ways. Mark D. Adams Department of Genetics Center for Computational Genomics Center for Human Genetics. Genome publications Feb/2001 ~30,000 genes, 3 million SNPs. Computing the Genome - Assembly. Mask heterochromatin and ribo-DNA,
E N D
Genomics: Looking at Life in New Ways Mark D. Adams Department of Genetics Center for Computational Genomics Center for Human Genetics
Genome publications Feb/2001 • ~30,000 genes, 3 million SNPs
Computing the Genome - Assembly Mask heterochromatin and ribo-DNA, Tag known interspersed repeats. Screener 8:37 Find all overlaps 40bp allowing 6% mismatch. (1000X Blast) Overlapper 86:25 • ASSEMBLER CORE: • Compute all consistent sub-assemblies = unitigs • Identify those that cover unique DNA = U-unitigs • Scaffold U-unitigs with confirmed shorts & longs • Then with BAC ends • Fill repeat gaps with: • I. Doubly anchored mates Unitiger 38:29 Scaffolder 4:12 Repeat Rez I Repeat Rez I, II, III II. O-path confirmed singly-anchored mates III. Greedy path completion using QVs 5:44+4:21+19:53 Consensus Bayesian “SNP” consensus using quality values. Occurs throughout assembler core. (~25)
Computing the Genome - Analysis • Gene Prediction • Repeat Elements • Large-scale structure • Did a genome-wide duplication occur in the evolution of the human genome?
After the genome…. • Define ‘finished’…. • Challenging regions • Centromeres • Annotation of genes (protein-coding and non-protein-coding) • Annotation of non-genic functional elements • More Genomes! • Identification of functionally important regions through analysis of conservation through evolution • ‘Comprehensive’ parts list • High-throughput mentality
Protein Structure Prediction/Comparison F28C12.5 ------------MSQLTAEELDSQKCASEGLT-SVLTSITMKFNFLFITTVILLSYC-FT F28C12.7 ------------MNKTAEDLLDSLKCASKDLS-SALTSVTIKFNCIFISTIVLISYC-FI T06G6.1 ------------MNKTAEELLDSLKCASDGLA-SALTSVTLKFNCAFISTIVLISYC-FS F28C12.2 ------------MNKTAEELD-SRNCASESLT-NALISITMKFNFIFIITVVLISYC-FT F28C12.3 ------------MNKTAEELLDSRKCASEGLT-NALTSFMMKMNFSFIVT---------- F28C12.4 ------------MNKTAEELVESLRCASEGLT-NALTSITVKVSFVFLATVILLSYY-FA T06G6.2 ------------MNKTAEEIVESRRCASEGLT-NALTSITVKMSSVLVVTVILLSYY-FA F28C12.1 --------------MNQTELLESLKCASEGMV-KAMTSTTMKLNFVFIATVIFLSFY-FA T26E3.9 ----------------MNELIDGPKCASEGIV-NAMTSIPVKISFLIIATVIFLSFY-FA F18C5.6 ---------------------MSSECARSDVH-NVLTSDSMKFNHCFIISIIIISFF-TT F18C5.8 ------------------MENLNPACASEDVK-NALTSPIMMLSHGFILMIIVVSFI-TT AH6.7 --------------------MSSQKCASHLEI-ARLESLNFKISQLIYFVLIITTLF-FT AH6.11 --------------------MSAPNCARKYDI-ARLSSLNFQISQYVYLSLISLTFI-FS AH6.8 --------------------MSLTKCASKLEI-DRLISLNFRINQIIVLIPVFITFI-FT AH6.14 --------------------MATIACASIIEQ-QRLRSSNFVIAQYIDLLCIVITFV-TT 1B0B 1O1O
Systems Biology DNA Protein Pathway/Partners Cell Organ/Tissue Organism Measurement Variation/Stimulus
Systems Biology Causality Complexity Coordination Robustness Resilience Systems Theory “The study of organization and behavior per se” (Wolkenhauer, Brief. Bioinform. 2:258, 2001)
Outline • Functional variation in the human genome • Extent of common protein variation • Genes that have evolved faster in human lineage • Mouse models of complex disease • Use of natural variation to infer a model of normal heart function
Aren’t there enough SNPs already? Yes! No! Yes! No! • Depends on disease mapping strategy infer direct Disease causing allele Genetic Marker • Deficiency of missense SNPs Risch 2000. Nature 405:847.
Identifying Common Sites of Variation March, 2001 <6,500 missense SNPs in 3,500 of 10,000 RefSeq genes
Identifying Common Sites of Variation SNP Discovery in: 20 Female Caucasians 19 Female African-Americans 1 Male chimpanzee
Re-sequencing Workflow • Primer design • Unique primers are designed around coding exons and human-mouse conserved segments in 1 kbp upstream of transcript • Splice sites should be sequenced most of the time 5’ UTR Conserved Regions with TF binding sites coding exons • Amplification & Sequencing • Re-arrayed primer and DNA plates are mixed to generate PCR and sequencing plate • Both strands are sequenced using the M13 tails on the primers • SNP detection • Polyphred analysis SNP scoring by expert system Manual QA • SNP annotation • SNPs mapped to the Celera reference genome and annotated with regards to gene location, mutation type, allele frequency, genotypes…..
Data Source: Human and Chimp 25K genes 23K genes 20K genes
Summary of SNPs found • >18 million lanes run (compare to 36 million for shotgun sequencing human genome) • 23,363 genes assayed from 30,115 in the genome • 265,978 Total SNPs • ~75% are novel • 36,900 missense SNPs • Doubled the number that were previously known
Why are we different from chimpanzees? Proteins are 97-100% identical King and Wilson, Science 188:107-116 (1975) • The differing 1-3% is important • The important differences are in gene regulation • A small number of genes of divergent function with a disproportionate impact
Goal • Identify genes that have shaped a particular species • Identify human genes that may be more likely to be involved in human disease Random drift mouse human chimp 4.6 – 6.2 MY 112 MY
Goal • Identify genes that have shaped a particular species • Identify human genes that may be more likely to be involved in human disease Natural Selection mouse human chimp 4.6 – 6.2 MY 112 MY
Metric • dN – Non-synonymous substitution rate • Nucleotide differences that CHANGE the amino acid sequence in orthologous proteins CGC (Arg) GGC (Gly) • dS – Synonymous substitution rate • Nucleotide changes that do not change the amino acid CGC (Arg) CGG (Arg) • dN/dS Ratio • dN = dS indicates neutral change • dN/dS < 1 indicates constraint/negative selection • dN/dS > 1 indicates possible positive selection
Caveats • Low dS causes problems • Divide by ~0 problem • Must match true orthologs • Paralogous genes are subject to differing evolutionary pressures • Annotation and alignment must be correct
List of human genes Human gene Chimp traces Determine mouse ortholog Determine coding sequence Build chimp transcript Build mouse transcript Determine what was “covered” Align to human Align to human QC alignment QC alignment Chimp Gene Passes Mouse Gene Passes Alignment files (2 or 3 species) Analysis
Data Set HUMAN CHIMP 7,645 coding sequence alignments MOUSE ORTHOLOG
Evidence Distribution 7645 MH Orthologs Evidence • Tblastx (+/-) • Syntenic anchor (+/-) • Syntenic block (+) • Shared protein family (+/-/0)
Method • Generate three-species (human-chimp-mouse) coding sequence alignments • Apply models of sequence divergence • Identify genes that violate null hypothesis Gene with accelerated evolution on the human branch Null hypothesis mouse mouse human human chimp chimp 4.6 – 6.2 MY 112 MY
Yang and Nielsen Evolutionary Model • Allows variation in the dN/dS ratio among lineages and among sites at the same time • Tests what is more likely: • all sites are either neutral (dN/dS =1) or evolve under negative selection (dN/dS < 1) • some sites are evolving under positive selection in the human (or chimp) lineage only Adapted from Mol. Biol. Evol. 19:908, 2002
a-Tectorin and hearing Tectorial membrane Hair cells • Protein plays a vital role in the tectorial membrane of the inner ear • Single amino acid polymorphisms are associated with familial high frequency hearing loss • Knockout mice are deaf
FOXP2 • Molecular evolution of FOXP2, a gene involved in speech and language • Enard, et al. Nature, 418:869, 2002 • “The ability to develop articulate speech relies on capabilities, such as fine control of the larynx and mouth, that are absent in chimpanzees and other great apes” • “FOXP2 seem to be required for acquisition of normal spoken language” • “FOXP2 … has been the target of selection during recent human evolution”
Enrichment of biological processes *= significant in 1 species **=significant in 2 species Model: dN/dS > 1 and >1 nonsyn sub, binomial test
Olfaction: human genes pseudogenes? Blue = pseudogene Red = gene Pseudogene status from HORDE: http://bioinformatics.weizmann.ac.il/HORDE/
C57BL/6J and A/J mice: Models for study of the metabolic syndrome On a high fat, high sucrose diet: C57BL/6J A/J ü Obesity X ü Hypertension X ü Hyperglycemia X ü Hypertriglyceridemia X ü Low HDL Cholesterol X ü Indicates that the strain develops the condition X Indicates that the strain does not develop the condition
Functional Networks Computational and genomic synthesis of complex systems from assays of components traits • Attributes • applicable to all kinds of biological traits • here - subtle, naturally-occurring, non-pathologic variation • quantitative and qualitative biological properties • monogenic and polygenic traits • additive and epistatic traits • uses results from all kinds of assays • healthy individuals to learn about normal biological functions • abnormal conditions to learn about disease processes
Perturbation tests • Traditional approach • Single gene mutations (endogenous challenge) • Drug treatments (exogenous challenges) • Both establish causal relations • But • How do we interpret networks derived from perturbations that have dramatic effects? • Alternative: Factorial design (after Fisher) • Segregating populations • Reference network based on normal variation • 3. Use to evaluate single gene mutations, • modifier genes and drug perturbations Nadeau, et al. Genome Research 13:2082, 2003
Heart: Proof-of-concept study transducer Echocardiography CW AWRV RV SW Aorta Aorta LV LV LA LA PW Abbreviations AWRV - anterior wall, right ventricle PW - posterior wall CW - chest wallRV - right ventricle LA - left atrium SW - septal wall LV - left ventricle
Echocardiography: Measures and calculations CW AWRV RV Cavity SW SWTh LV Cavity ESD EDD PW PWTh AbbreviationsCalculations EDD - end diastolic dimension FS (fractional shortenting) = (EDD - ESD) / EDD ESD - end systolic dimension LV mass = 1.06 x [(EDD + PWTh + SWTh)3 – (EDD)3] PWTh - posterior wall thickness Th/r = (PWTH + SWTh) / EDD SWTh - septal wall thickness SV (stroke volume) = EDD3 - ESD3 HR = beats per min CO (cardiac output) = SV x HR Time
Summary of cardiovascular traits C57BL/6J A/J LV mass (g) 46.2 +- 14.1 32.7 +- 11.5 * LV EDD (mm) 3.31 +- 0.42 2.83 +- 0.31 * LV ESD (mm) 2.01 +- 0.32 1.49 +- 0.25 * Exercise time (min) 9.6 +- 3.4 4.4 +- 1.9 * LV frac. shortening (%) 39.1 +- 6.2 47.1 +- 6.9 * Vcf(s-1) 8.8 +- 1.9 11.7 +- 2.6 * SW Th (mm) 0.49 +- 0.06 0.47 +- 0.07 PW Th (mm) 0.49 +- 0.05 0.45 +- 0.08 LV mass / BW (mg/g) 1.96 +- 0.38 1.54 +- 0.43 Rel wall thickness 0.30 +- 0.04 0.32 +- 0.04 HR (echo; bpm) 433 +- 55 524 +- 45 HR (tail cuff; bpm) 615 +- 79 694 +- 75 Systolic BP (mm Hg) 122 +- 13 123 +- 20.8 Cardiac output (ml/min) 0.58 +- 0.19 0.50 +- 0.17 These strains were not constructed to differ in CV functions Perturbations Subtle Naturally-occurring Non-pathologic B6: ‘athlete’s heart’, physiologic hypertrophy, exercise endurance Alternative genetic solutions to the same cardiovascular problem
Randomizing genomes in recombinant inbred strains A/J B6 AXB1 AXB2 AXB3 BXA30 Chr 1 Chr 2 Chr 3 Chr 4 Chr 5 Chr 6 Chr7 ’’ Chr X Ht rate: 680 590 691 585 597 666 Exer time: 233 582 540 597 241 255 Key features Probability of coincidental match for 2 strains: 0.50 (50% chance of fixing A or B allele). Probability of coincidental match for 30 strains: (0.50)29 = <2 x10-9 !!! (These results apply a single gene trait; probabilities are lower for polygenic traits)
Methods: building functional networks Strain (randomized genetics) TraitS1 S2 S3 . . Sn T1 # # # # T2 # # # # T3 # # . . Tn # Trait TraitT1 T2 T3 . . Tn T2 r12 -- -- -- T3 r13 r23 T4 r14 r24 r34 . . Tn r1n 2. Estimate cosegregation 1. Type traits Trait TraitT1 T2 T3 . . Tn T2 +r12 T3 -- -- T4 +r14 +r24 -r34 . . Tn -- -- -- . . -- 2a. Cluster analysis Trait 1 Trait 3 Trait 4 Trait 2 2b. Identify significant relations Trait n 3. Identify networks Trait 1 Trait 3 Trait 4 Trait 2 Trait n
Segregation of CV traits in AXB / BXA RI strains • multigenic variation • positive cosegregation r = 0.88 r2 = 0.77 • transgressive variation • (traits exceeding parental values)