1 / 47

CANDID: A cand idate gene id entification tool Part 2

CANDID: A cand idate gene id entification tool Part 2. Janna Hutz jehutz@artsci.wustl.edu March 26, 2007. Review. Literature Well-characterized genes Protein domains All genes Cross-species conservation All genes. Today’s agenda. Expression levels Linkage data Association data

vail
Download Presentation

CANDID: A cand idate gene id entification tool Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CANDID:A candidate gene identification toolPart 2 Janna Hutz jehutz@artsci.wustl.edu March 26, 2007

  2. Review • Literature • Well-characterized genes • Protein domains • All genes • Cross-species conservation • All genes

  3. Today’s agenda • Expression levels • Linkage data • Association data • CANDID performance measures

  4. Candidate lists vs.single candidates • Candidate lists • Complex trait or disease • Disease with known heterogeneity • Single candidates • Mendelian trait • New disease • Disease with clear, well-defined pathology

  5. Candidate lists vs. single candidates • Microarray • SNP typing • Sequencing • Immunocytochemistry • Knockout model ACT[A/G]GGA

  6. Example 4 • Goiter - thyroid gland problem • Iodine deficiency • Genetic causes

  7. Example 4 • Iodine is not supplied • Iodine is present, but is not added to the molecule • Which gene is mutated?

  8. Expression data • We know what tissue our gene is expressed in (thryoid). • How can we use this knowledge to help identify the candidate? • Wouldn’t it be nice if we had an expression database?

  9. Expression databases • Our ideal expression database would have: • Expression data for the same genes across many different tissues • As many tissues as possible • As many genes as possible • Good documentation • Gene Atlas

  10. Gene Atlas • Genomics Institute of the Novartis Research Foundation • 79 human tissues (160 samples) • 2 arrays • Affymetrix HG-U133A • GNF1H (custom) • 17,809 genes

  11. heart brain thyroid lung Measure of gene expression • Our thyroid gene: • Gene that is brightest on the thyroid array? • Gene that is brightest on the thyroid array, compared to all the other arrays.

  12. Measures of gene expression • Run CANDID, specifying that we’re interested in the thyroid. http://dsgweb.wustl.edu/llfs/secure_html/hutz/index.html User name: workshop Password: perl031907 • (We’ll need a tissue code for that.)

  13. Example 4 - Results • Our favorite genes: • TP53 - rank is… • 16314th • KRAS - rank is… • 5229th • What genes are ranked most highly?

  14. Example 4 - Results • 192 genes with expression score of 1 • The TOP gene is actually responsible for the phenotype described earlier • Its expression score = 1

  15. Prior evidence • I’m not interested in examining all of the genes in the genome - just some of them. • Linkage and association

  16. Linkage • CANDID can: • Weight regions with higher LOD scores • Limit analysis to certain regions • How does it do this?

  17. gene’s LOD score maximum genome-wide LOD score Linkage scoring 17 3 2

  18. Linkage files • How does CANDID get this linkage information? • CANDID takes two kinds of files • Unformatted output from GENEHUNTER and MERLIN • Custom linkage files

  19. Custom linkage files • Simple format • Line 1 of the file must contain the word “custom” somewhere • Subsequent lines: Chromosome (tab) cM (tab) LOD score • But how do I get cM positions?

  20. Mapmaker • Inputs file as: Chromosome (tab) basepair (tab) LOD score • Outputs new file in the format: Chromosome (tab) cM (tab) LOD score • Will be available on the CANDID website soon

  21. pancreatic cancer Example 5 • Deletion on chromosome 13 between 23.65 cM and 25.08 cM.

  22. 23.65 25.08 Creating a custom linkage file • Example: custom 13 23.64 0 13 23.65 3 13 25.08 3 13 25.08 0

  23. Running CANDID • Try running CANDID using only the linkage criterion. • Now, run CANDID with the linkage criterion and literature criterion (your choice of keywords) • Linkage weight = 1000 • Literature weight = 1

  24. Results • From OMIM: “Individuals with mutations in the BRCA2 gene, which predisposes to breast and ovarian carcinoma, have an increased risk of pancreatic cancer; germline mutations in BRCA2 are the most common inherited alteration identified in familial pancreatic cancer.”

  25. But linkage is so last season…

  26. Association • Increasing numbers of association studies • Increasing numbers of SNPs in each study • Can CANDID use this information, too?

  27. Association • Database • dbSNP - 11.8 million human SNPs • Includes HapMap SNPs • Most comprehensive • Each snp has a number prefixed with “rs”

  28. Association • How does CANDID accept association data? • Custom file format - each line is: rs# (tab) p-value

  29. Association scoring • For each gene, take the best p-value for that gene’s SNPs • Subtract that p-value from 1 • Unless you test SNPs in every gene, this can be kind of unfair…

  30. Association scoring • Tested 10 genes • Gene 9 has a best p-value of 0.8 (bad) • Gene X was not tested • Should Gene 9 get a higher overall score than Gene X?

  31. p-value threshold • User defines a p-value threshold • Let’s say it’s 0.1. • Any SNPs with p-values above 0.1 are not considered. • Now Gene 9 and Gene X have the same score (0).

  32. Example 6 • Age-related Eye Disease Study • Macular degeneration

  33. Example 6 • Make custom association file rs3753396 0.0444 rs543879 0.0494 rs7724788 0.75 • Run CANDID with this association file

  34. Results } CFH } SLC25A46 rs3753396 0.0444 rs543879 0.0494 rs7724788 0.75

  35. So just how well does this work anyway?

  36. Preliminary evidence • Online Mendelian Inheritance in Man • 154 diseases linked to chromosome 1 • Literature, domains - chose keywords • Conservation • Expression - chose tissue codes

  37. Ideal weights • Tested all combinations of weights in those 4 categories • Possible weights: (0, 0.1, … , 0.9, 1) • Which weight combination was the best, across all 154 diseases?

  38. Top 10 weight combinations • Literature = 1, everything else = 0 • Literature = 0.9, everything else = 0 • Literature = 0.8, everything else = 0 • Literature = 0.7, everything else = 0 • … 10. Literature = 0.1, everything else = 0 11. Literature = 1, domains = 0.1

  39. More specifics • Literature only: average ranking = 425 • 425/38697 = 98.9th percentile • 44/154 genes ranked #1 for at least one set of weights • Chromosome 1: average ranking = 22 • 22/2280 = 99th percentile • 84/154 genes ranked #1 for at least one set of weights

  40. Analysis of results • They make a lot of sense. • Genes in OMIM are, by definition, well-characterized. • Many diseases are rare, with particular names or keywords that would only appear in papers about the disease genes.

  41. Next steps • Separate OMIM analysis into simple and complex traits • Get new ideal weights • See how well these ideal weights do in ranking candidates from chromosome 2.

  42. Next steps • CANDID’s databases were last compiled in November 2006. • Find publications that have come out since then. • How well does CANDID do in ranking those genes?

  43. Next steps • Many new whole-genome studies and microarray studies implicate lists of candidates. • If CANDID analyzes those phenotypes, how significant is the overlap of CANDID’s top genes and those papers’ top genes?

  44. Next steps • Any other suggestions? • Any interesting data you have?

  45. Any questions?

  46. Mike Province Howard McLeod Aldi Kraja Ingrid Borecki Qunyuan Zhang Ryan Christensen John Martin Acknowledgments

More Related