Disease Gene Prioritization

Disease Gene Prioritization Presented by Qian Huang

What is a disease • Disease is a condition of the living animal or plant or one of its parts that impairs normal function and is typically manifested by distinguishing signs and symptoms • Definition also describes the malfunction of individual cells or cell groups • Many diseases should be defined on a cellular level. • Sickle cell disease was first documented in 1904 • sickle cell disease became the first disease to be characterized on a molecular level in 1949 • The first genetic diseases was discovered

Genetic diseases • A genetic disease is any disease that is caused by an abnormality in an individual's genome • It is rarely that one gene is responsible for one function. An assembly of genes constitutes a functional module or a molecular pathway. • a molecular pathway leads to some specific end point in cellular functionality via a series of interactions between molecules in the cell. • Any changes in the normally molecular interactions and pathways may lead to disease • The specifics of a change determine the severity and the type of the resulting disease

Genetic diseases • inherited from the parents or caused by mutations • a number of different types of genetic diseases • Single gene disorder - Mendelian or monogenetic inheritance - caused by changes or mutations that occur in the DNA sequence of a single gene - over 4000 human diseases caused by single gene disorder -occur in about 1 out of every 200 births - dominant: Only one mutated copy of the gene will be necessary for a person to be affected. one affected parent, 50% chance - recessive: Two copies of the gene must be mutated for a person to be affected. Two unaffected people each carry one copy of the mutated gene, 25% chance the child affected

Genetic diseases • Chromosome abnormalities - distinct structures made up of DNA and protein - caused by abnormalities in chromosome number or structure - due to a problem with cell division • Multifactorial gene disorder - caused by a combination of environmental factors and mutations in multiple genes - heart disease, high blood pressure, cancer, diabetes, obesity

Genetic diseases • Identifying the relationship between human genetic diseases and their causal genes is important in human medical improvement • Revealing the genetic basis of human disease is a fundamental aim of the human genetic studies • The Human Genomic Project started in 1990 • The genomic studies rapid accumulate large amount of genomic data • a lot of computational methods were proposed to prioritize candidate casual genes by considering the relationship of candidate genes of a given phenotype and existing known disease genes

What is Gene Prioritization • Gene prioritization is the process of assigning likelihood of gene involvement in generating a disease phenotype. • narrows down, and arranges the set of genes to be tested experimentally. • based on various correlative evidence that associate each gene with the given disease and suggest possible causal links • Evidence comes from high-throughput experimentation, including gene expression and function, pathway involvement, and mutation effects

Why using Gene Prioritization • Proving a causal link between a gene and a disease experimentally is expensive and time-consuming • Using computational prioritization of candidate genes prior to experimental testing can drastically reduce the associated costs and improve the outcomes of targeted experimental studies • High-throughput experimental techniques has contributed significantly to the identification of disease-associated genes and mutations and reported a large number of data • Gene prioritization is a computational method to deal with the quantity of data, effectively translate the experimental data into legible disease-gene associations

Identification of disease-genes • Disease results from the changes of normal function • Four reasons of pathway function changes (1) changes in gene expression (2) changes in structure of the gene-product (3) introduction of new pathway members (4) environmental disruptions • Defining molecular pathways whose disrupted functionality is necessary and sufficient to cause the disease • All members of the affected pathways can be construed as disease genes • Identification of disease-genes is difficult

How to identify disease-genes • Disease genes are most often identified using: (1) genome wide association or linkage analysis studies (2) similarity or linkage to co-expression with known disease genes (3) participation in known disease-associated pathways or compartments. • Methods represented by direct and indirect evidence • Direct : evidence coming from own experimental work and from literature • Indirect: genes that are in any way related to already established disease-associated genes

Indirect evidence Very broadly, gene-disease associations are inferred from evidence of five aspects (1) Functional Evidence The suspect gene is a member of the same molecular pathways as other disease-genes (2)Cross-species Evidence The suspect gene has homologues implicated in generating similar phenotypes in other organisms

Indirect evidence (3) Same-compartment Evidence The suspect gene is active in disease-associated pathways (e.g. ion channels), cellular compartments (e.g. cell membrane), and tissues (e.g. Liver) (4) Mutation Evidence The suspect genes are affected by functionally deleterious mutations in genomes (5) Text Evidence There is ample co-occurrence of gene and disease terms in scientific texts

Overview of gene prioritization data flow

Molecular Interactions • Many gene prioritization tools used gene-gene (protein- protein) interaction and pathway information to prioritize candidate genes. • genes responsible for similar diseases often participate in the same interaction networks • MC4R is a receptor and known to be associated with severe obesity • The interactorsof MC4R may be predicted to be linked to obesity. • AgRP and POMC directly bind MC4R for varied purposes of the MC4R pathway. • mutations that negatively affect normal POMC production or processing have been shown to be obesity- associated • AgRP have been linked to food intake abnormalities

MC4R-centered protein-protein interaction network

Regulatory and genetic linkage • Co-regulation of genes has traditionally been thought to point to same molecular pathways and similar disease • Co-expressed genes often cluster together in different species lead to genetic linkage • Genes co-expressed with or genetically linked to other disease genes are also likely to be disease-associated • They also pose a problem

Problem with Regulatory and genetic linkage • A given disease-associated gene may be co-regulated with or linked to another disease-associated gene • The two diseases are not identical • It is difficult to distinguish the actual causes of disease and co-occurring with the disease-mutations due to genetic linkage.

Similar sequence/structure/ function • Prioritization tools often use functional similarity as an input feature • Predictors relying on functional similarity to determine disease association will link two genes sharing a same function • Functionally similar genes are likely to produce similar disease phenotypes, sequence/structure similarities are indicators of similar disease involvement • Disease genes are often associated with specific gene and protein features • higher exon number • longer gene length

Cross-species Evidence • Cross-species comparisons of orthologues and their associated phenotype • Finding related phenotypes across species suggests orthologous human candidate genes • MC4R is known to be associated with severe obesity • Polar bears have a V95I mutation on MC4R for their need to increase body fat to adapt to their environment • may have a similar (increased body fat) effect in humans

Cross-species Evidence • A correlation of gene co-expression across species is also useful for gene prioritization • Genes that are part of the same functional module are generally co-expressed • functionally unrelated genes also could be co-expressed • Comparing genes co-expressed in human and other organisms can be used to infer disease-genes • Acluster of functionally unrelated genes co-expressed in human and mouse contained a disease-gene KCNIP4 • The initial list of 1,762 genes mapped to 850 OMIM (Online Mendelian Inheritance in Man)phenotypes narrow to twenty times fewer possible disease-causing genes.

Compartment Evidence • Changes in gene expression in disease-affected compartments and tissues are associated with many complex diseases • Predict suspect gene in the disease-associated pathway, compartments and tissues. • multiple storage diseases all are caused by the impairment of the degradation pathways of the intracellular transport.

Mutant Evidence • Every genetic disease is associated with some sort of mutation that alters normal functionality • Selection of candidate genes for further analysis is often based on mutations in diseased individuals • not all observed mutations are associated with deleterious effects: (1) no effect at all - silent mutations (2) some is deleterious with respect to normal function (3) weakly beneficial • strongly deleterious mutation are relatively rare because they are rapidly removed by selection • A candidate gene carrying a deleterious mutation is more likely to be disease-associated than gene with other mutationor no mutationat all

Mutant Evidence • Structural variation (SV) insertions and deletions, inversions, translocations.. • Nucleotide polymorphisms SNPs (single nucleotide polymorphisms) MNPs (multi-nucleotide polymorphisms • 90% of human variation exists in the form of Nucleotide polymorphisms

Structural variation • Structural variation (SV) is the least studied of all types of mutations • less than 10% of human genetic variation is in the form of genome structural variants • Each of the structural variants is large, the total number of base pairs affected by SVs may actually be comparable to the number of base pairs affected by the much more common SNPs • Gross changes to genome sequence are very likely to be disease associated

Structural variation • High throughput detection of structural variants is difficult • there are only 180 thousand structural variants reported in one of the most complete mutation collections-DGV • Do not currently know what proportion of genetic disease is caused by SVs • Disease is caused by change of a sequence, all of the genes found in these regions of the genome are, by default, associated with the disease, but none of them can be considered primarily causal • If diseases that are associated with SVs, the prioritization of disease-causing genes is only finding those that are directly affected by the mutation

Nucleotide polymorphisms • A single human genome is expected to contain roughly 10–15 million SNPs per person • 93% of all human genes contain at least one SNP • MNPs are rare as compared to SNPs • nearly 43 million validated human SNPs - coding SNPs - non-coding SNPs

coding SNPs and non-coding SNPs (1) coding SNPs 17.5 million have been experimentally mapped to functionally distinct regions of the genome, only 0.4M are in coding region while the rest are in non-coding region Coding SNPs are over-represented in disease associations e.g. OMIM contains 2430 non-coding SNPs (0.0001% of all) and 5327 coding ones (0.01% of all) - synonymous (no effect on protein sequence) - non-synonymous (single amino acid substitution) (2) non-coding SNPs more prevalent than coding SNPs because the majority of the genome is non-coding

non-synonymous SNPs • more studied • two types: - Missense : change results in a different amino acid - Nonsense : produce a premature stop codon • nonsense mutants result in early termination of the protein, very often associated with disease • Missense SNPs alter the protein sequence without destroying it, may or may not be disease associated • most methods estimate that only 25–30% of the nsSNPs negatively affect protein function

Nucleotide polymorphisms • Identifying and annotating functional effects of SNPs and MNPs is important in the gene prioritization • Genes selected for further disease-association studies are more likely to contain a deleterious mutation • A number of methods were created for identifying mutations as functionally deleterious in regulatory regions • Coding synonymous SNPs have recently been shown to have the same chance of being involved in a disease as non-coding SNPs due to reasons such as codon usage bias • Few computational methods are able make predictions with their functional effects

Text Evidence • Huge amounts of datacould potentially improve the performance of any gene prioritization method • Specialized tools can be used to prioritize diseases associated genes • Researchers make their data computationally available from databases • Depositing knowledge obtained through reading and manual curation into databases

Text Evidence • Hidden in plain site in natural language text of scientific publications • A casual search in PubMed for the term breast cancer generates over two hundred thousand matches. Limiting the field to genetics of breast cancer reduces to fifty thousand. • Scientific text mining tools allow for intelligent identification of possible gene-gene and disease-gene correlations

The Inputs and Outputs • Functionality of prioritization methods is defined by previously known information about the disease and candidate search space • Disease information: disease-associated genes, affected tissues and pathways and relevant keywords • candidate search space : - automatically selected by the tool (the entire genome) - submitted by the user (suspect genes) • providing a list of very broad keywords may reduce the performance specificity, while incorrect candidate search space automatically decreases sensitivity • Output is produced based on input, produce ranked/ordered lists of genes • The prioritization accuracydepends on the accuracy and specificity of the inputs

The Processing • Gene prioritization methods use different algorithms to product all the data they extract • mathematical/statistical models/methods • fuzzy logic • artificial learning devices • Some methods use combinations of the above. • No one methodology is better than the others for all data inputs • refer to relevant tool publications and method-specific literatureto get details on computational methods used in the various approaches

Figure 5. Predicting gene-disease involvement using artificial neural networks (ANNs). Bromberg Y (2013) Chapter 15: Disease Gene Prioritization. PLoS Comput Biol 9(4): e1002902. doi:10.1371/journal.pcbi.1002902 http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1002902

Gene prioritization tools • Many different Gene prioritization toolshave been developed • Endeavour: - a web resource for the prioritization of candidate gene - inferring several models (based on various genomic data sources) - applying each model to the candidate genes to rank those candidates against the profile of the known genes and - merging the several rankings into a global ranking of the candidate genes

Gene prioritization tools • G2D (genes to diseases): - a web resource for prioritizing candidates genes - uses three algorithms based on different prioritization strategies - input: (1) the genomic region where the user is looking for the disease-causing mutation, (2) an additional piece of information depending on the algorithm used - output in every case is an ordered list of candidate genes in the region of interest

Summary • Gene prioritization methods are developed to link genes to diseases by extracting and combining the various information • Rely on experimental work such as disease gene linkage analysis and genome wide studies to establish the search space of candidate genes • Use mathematical and computational models of disease to filter the original set of genes based on gene and protein sequence, structure, function, interaction and expressioninformation • Translate the experimental data into legible disease-gene associations effectively

Disease Gene Prioritization