IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES L. Coin and R. Durbin Wellcome Trust Sanger Institute BIOINFORMATICS 2004 Presented by: Oscar Sanchez Plazas

Outline • Problem definition • Previous works on pseudogene identification • Proposed method • Protein domain profile (Pfam) • Algorithm • Results and Discussion

Pseudogene Identification • Pseudogene: Remnants of genomic sequences of genes that are no longer translated into functional proteins. • Non-processed (duplicated): Product of genome duplication (paralogous) loss of function at the transcription or translation level • Processed (~70%): Product of retro-transposition No introns, no promoter (*) Plagiarized Errors and Molecular Genetics. Edward E. Max, M.D.

(*) http://www.pseudogene.org/definition.html

Pseudogenes • Significance: • Comparative Genomic • Evolution of DNA, new gene expression, patterns • Study of mechanisms for regulation of gene expression • Verification of gene sequences in databases

Pseudogenes • Are they functional? (why high conservation compared to prokaryotes?) “Pseudogenes exhibit evolutionary conservation of gene sequence, reduced nucleotide variability, excess synonymous over nonsynonymous nucleotide polymorphism, and other features that are expected in genes or DNA sequences that have functional roles”1 (1) PSEUDOGENES: Are They “Junk” or Functional DNA? Evgeniy S. Balakirev, Francisco Ayala. 2003 - An expressed pseudogene regulates the messenger stability of its homologous coding gene. Nature, Hirotsune,S. et al. 2003 - The putatively functional Mkrn1-p1 pseudogene is neither expressed nor imprinted, nor does it regulate its source gene in trans. Gray TA, Wilson A, Fortin PJ, Nicholls RD. PNAS. 2006 (*) www.answersingenesis.org/tj/v17/i2/pseudogene.asp

Problem • Sometimes pseudogenes are mis-annotated in gene sequence databases as functional genes. Key Insight: • Employ a evolutionary constraint model derived from a functional characterization over the gene product. • Constrained vs. neutral model

Previous approaches • Presence of stop codon and frameshift. • Not very sensitive (~50% are detectable ) (*) Large-scale analysis of pseudogenes in the human genome Zhao Lei Zhang, Mark Gerstein

Previous approaches • Ratio of synonymous and non-synonymous substitutions (dN/dS) • Not very accurate: e.g. gene under positive selection pressure. (*) Genome-wide survey of human pseudogenes. Torrents,D., Suyama,M., Zdobnov,E. and Bork,P.

Model Proposed • PSILC: Pseudogene inference from loss of constraint (log-odd score) • Protein Domain evolution (functional constrain) - Null probability model (Pfam) • Neutral nucleotide model • Protein coding model

Domain Profile - HMM • Protein Domains: structural, functional and evolutionary units of proteins • HMM profiles: the most sensitive models for domains • Every state has a particular emission distribution over {A,C,T,G} (*) genome.nasa.gov/MediaLib/hmm_project_fig2.jpg deletion insertion match

(*) http://pfam.sanger.ac.uk//family/TAF

Model Proposed • Objective • Look at pattern of substitution in conserved protein domains • Algorithm • Input • Alignment A • Unrooted tree T • Profile HMM D (aligned with A) • Output • Score for a leaf of the tree which represents the belief that the node corresponds to a pseudogene.

Algorithm • Notation • Xn. : row corresponding to leaf-node n. • X.i : i-th column. • A\Xn. : Alignment A excluding Xn. • mj : j-th match column of profile HMM. • pn : parent node of n. • bn : branch from pn to n. • T\bn : Tree T excluding bn.

Algorithm • Input: Unrooted tree T, Alignment A, profile HMM D • Output: Log-odds scores: • A neutral nucleotide model compared to a Pfam domain encoding model (PSILC-nuc/dom) • A protein coding model compared to a Pfam domain encoding model (PSILC-prot/dom). Evolutionary model

Algorithm • Independence assumptions • xni respect to other columns in the row given A\xn • xni respect to other columns in A\xn given x.i\xni • Tree assumption: xni respect to x.i\xn given xpni

Algorithm • Steps: • Calculate the distribution at xpni given the evolutionary constraints on the other branches. • For each residue/base at xpni, calculate the transition probability to xni given the evolutionary constraints. • pn is set as the root of the T • Prior distribution: Stationary dist. of Q

Evolutionary Model • Instantaneous rate matrix (Q)*: • DNA models: HKY model (^ - uniform) • Amino acid model: database estimates (WAG, ^) •  - steady state distribution (vs. equilibrium): • Alternative models:  observed in A • Null model: distribution of the state in the HMM • Parameters (ML): • f: trade off mutation pressure (from-to) • r: evolutionary rate • : ratio transition/transversion (*) A Novel Use of Equilibrium Frequencies in Models of Sequence Evolution Nick Goldman and Simon Whelan

Algorithm • Directionality of the calculation • Score on an alignment of two transcripts x1, x2 is not symmetric (detailed balance). • If base x1i is more likely than x2i at a particular match state but equally likely under the protein model, score for x2. being a pseudogene is higher than score for x1. • dN/dS does not have this property (a third sequence should be used). • Requires a PFam model (independent)

Results Data: Cromosome 6 human genome • Manually annotated (pseudo)genes • Blast search-ENSEMBL e<10^-7 (>80%) (<99%) • Multiple alignment: ClustalW • Max. likelihood distance. • Nearest neighbor tree. • 598 (875) coding transcripts, 97 (158) pseudogenes

Results • ROC Why PSILC-prot/dom is better than PSILC-nuc/dom?

Results • Better discrimination

Question • What is the main difference between the HMM’s previously studied (eg. Pairwise alignment) and the HMM profiles? Why the latter HMM’s are important for the identification of pseudogenes?

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Presentation Transcript

Improved Index Compression Techniques for Versioned Document Collections

Culture-based techniques for identification of pathogens in environmental samples

Horizontally split facial composites show improved identification

Improved techniques for the identification of pseudogenes

Repeats, Pseudogenes etc.

Automated techniques for identification of synthesized individuals

Penetration Testing with Improved Input Vector Identification

Risk Identification Techniques

Economic analysis of improved water management techniques

Improved Techniques for Result Caching in Web Search Engines

Risk Identification: Specific Techniques for Particular Perspectives.

ENCODE Pseudogenes and Transcription

IMPROVED TECHNIQUES FOR THE IDENTIFICATION OF PSEUDOGENES

Survey of Misannotations and Pseudogenes in the Arabidopsis Genome

Thermal Control Techniques for Improved DT Layering of Indirect Drive IFE Targets

Dynamic Self-checking Techniques for Improved Tamper Resistance

Thermal Control Techniques for Improved DT Layering of Indirect Drive IFE Targets

Improved Techniques for Result Caching in Web Search Engines

Identification of Tamper Detection Techniques for Digital Video Forensics

Properties of pseudogenes ( G)

Genome-wide identification of pseudogenes that can be transcribed

Choice of Bio fill Filter Media for Improved Purification Techniques