470 likes | 568 Views
CANDID: A cand idate gene id entification tool Part 2. Janna Hutz jehutz@artsci.wustl.edu March 26, 2007. Review. Literature Well-characterized genes Protein domains All genes Cross-species conservation All genes. Today’s agenda. Expression levels Linkage data Association data
E N D
CANDID:A candidate gene identification toolPart 2 Janna Hutz jehutz@artsci.wustl.edu March 26, 2007
Review • Literature • Well-characterized genes • Protein domains • All genes • Cross-species conservation • All genes
Today’s agenda • Expression levels • Linkage data • Association data • CANDID performance measures
Candidate lists vs.single candidates • Candidate lists • Complex trait or disease • Disease with known heterogeneity • Single candidates • Mendelian trait • New disease • Disease with clear, well-defined pathology
Candidate lists vs. single candidates • Microarray • SNP typing • Sequencing • Immunocytochemistry • Knockout model ACT[A/G]GGA
Example 4 • Goiter - thyroid gland problem • Iodine deficiency • Genetic causes
Example 4 • Iodine is not supplied • Iodine is present, but is not added to the molecule • Which gene is mutated?
Expression data • We know what tissue our gene is expressed in (thryoid). • How can we use this knowledge to help identify the candidate? • Wouldn’t it be nice if we had an expression database?
Expression databases • Our ideal expression database would have: • Expression data for the same genes across many different tissues • As many tissues as possible • As many genes as possible • Good documentation • Gene Atlas
Gene Atlas • Genomics Institute of the Novartis Research Foundation • 79 human tissues (160 samples) • 2 arrays • Affymetrix HG-U133A • GNF1H (custom) • 17,809 genes
heart brain thyroid lung Measure of gene expression • Our thyroid gene: • Gene that is brightest on the thyroid array? • Gene that is brightest on the thyroid array, compared to all the other arrays.
Measures of gene expression • Run CANDID, specifying that we’re interested in the thyroid. http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html User name: workshop Password: perl031907 • (We’ll need a tissue code for that.)
Example 4 - Results • Our favorite genes: • TP53 - rank is… • 16314th • KRAS - rank is… • 5229th • What genes are ranked most highly?
Example 4 - Results • 192 genes with expression score of 1 • The TOP gene is actually responsible for the phenotype described earlier • Its expression score = 1
Prior evidence • I’m not interested in examining all of the genes in the genome - just some of them. • Linkage and association
Linkage • CANDID can: • Weight regions with higher LOD scores • Limit analysis to certain regions • How does it do this?
gene’s LOD score maximum genome-wide LOD score Linkage scoring 17 3 2
Linkage files • How does CANDID get this linkage information? • CANDID takes two kinds of files • Unformatted output from GENEHUNTER and MERLIN • Custom linkage files
Custom linkage files • Simple format • Line 1 of the file must contain the word “custom” somewhere • Subsequent lines: Chromosome (tab) cM (tab) LOD score • But how do I get cM positions?
Mapmaker • Inputs file as: Chromosome (tab) basepair (tab) LOD score • Outputs new file in the format: Chromosome (tab) cM (tab) LOD score • Will be available on the CANDID website soon
pancreatic cancer Example 5 • Deletion on chromosome 13 between 23.65 cM and 25.08 cM.
23.65 25.08 Creating a custom linkage file • Example: custom 13 23.64 0 13 23.65 3 13 25.08 3 13 25.08 0
Running CANDID • Try running CANDID using only the linkage criterion. • Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) • Linkage weight = 1000 • Literature weight = 1
Results • From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”
Association • Increasing numbers of association studies • Increasing numbers of SNPs in each study • Can CANDID use this information, too?
Association • Database • dbSNP - 11.8 million human SNPs • Includes HapMap SNPs • Most comprehensive • Each snp has a number prefixed with “rs”
Association • How does CANDID accept association data? • Custom file format - each line is: rs# (tab) p-value
Association scoring • For each gene, take the best p-value for that gene’s SNPs • Subtract that p-value from 1 • Unless you test SNPs in every gene, this can be kind of unfair…
Association scoring • Tested 10 genes • Gene 9 has a best p-value of 0.8 (bad) • Gene X was not tested • Should Gene 9 get a higher overall score than Gene X?
p-value threshold • User defines a p-value threshold • Let’s say it’s 0.1. • Any SNPs with p-values above 0.1 are not considered. • Now Gene 9 and Gene X have the same score (0).
Example 6 • Age-related Eye Disease Study • Macular degeneration
Example 6 • Make custom association file rs3753396 0.0444 rs543879 0.0494 rs7724788 0.75 • Run CANDID with this association file
Results } CFH } SLC25A46 rs3753396 0.0444 rs543879 0.0494 rs7724788 0.75
Preliminary evidence • Online Mendelian Inheritance in Man • 154 diseases linked to chromosome 1 • Literature, domains - chose keywords • Conservation • Expression - chose tissue codes
Ideal weights • Tested all combinations of weights in those 4 categories • Possible weights: (0, 0.1, … , 0.9, 1) • Which weight combination was the best, across all 154 diseases?
Top 10 weight combinations • Literature = 1, everything else = 0 • Literature = 0.9, everything else = 0 • Literature = 0.8, everything else = 0 • Literature = 0.7, everything else = 0 • … 10. Literature = 0.1, everything else = 0 11. Literature = 1, domains = 0.1
More specifics • Literature only: average ranking = 425 • 425/38697 = 98.9th percentile • 44/154 genes ranked #1 for at least one set of weights • Chromosome 1: average ranking = 22 • 22/2280 = 99th percentile • 84/154 genes ranked #1 for at least one set of weights
Analysis of results • They make a lot of sense. • Genes in OMIM are, by definition, well-characterized. • Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.
Next steps • Separate OMIM analysis into simple and complex traits • Get new ideal weights • See how well these ideal weights do in ranking candidates from chromosome 2.
Next steps • CANDID’s databases were last compiled in November 2006. • Find publications that have come out since then. • How well does CANDID do in ranking those genes?
Next steps • Many new whole-genome studies and microarray studies implicate lists of candidates. • If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?
Next steps • Any other suggestions? • Any interesting data you have?
Mike Province Howard McLeod Aldi Kraja Ingrid Borecki Qunyuan Zhang Ryan Christensen John Martin Acknowledgments