EXPLORING DEAD GENES

EXPLORING DEAD GENES Adrienne Manuel I400

What are they? • Dead Genes are also called Pseudogenes • Pseudogenes are non functioning copies of genes in DNA • Results from reverse transcription from an mRNA transcript • Or from gene duplication and subsequent disablement

Expression of Pseudogenes • Evidently transcribed • Expression of pseudogenes vary • Snail (lymnaea stagnalis) example of an organism that still has functioning

Pseudogenes, Good and Bad! • - Raised expression for tumor cells • + Useful in studying molecular evolution • + Helpful in determining rates of genomic DNA Loss for an organism

Size and Distribution of Pseudogenes DEFINING POPULATIONS AND SUBPOPULATIONS • ‘G’ the total population of confirmed and predicted protein-encoding genes • ΨG is the estimated population of pseudogenes that correspond to G

The Set of genes with at least one verifying EST match was derived GE • A set of genes that were deemed to be highly expressed was derived from microarray expression data and denoted GM • The corresponding predicted tool or pseudogenes is denoted ΨGM

Data Files • Sanger Sequencing Centre ftp (ftp://ftp.sanger.ac.uk) in this website are the six complete sequences of worm chromosomes • GFF Data Files with annotations for genes and other genomic features that correspond to wormpep18 • Arranged were the pseudogene population in the form of a pipeline

Pipelines Step 1: Sanger centre pseudogene annotations • Start with list of 332 pseudogenes • Pseudogene population was derived by looking for gene disablement Step 2: FASTA matching to find potential pseudogenes

PIPELINES (continued) • Worm genes masked for low complexity region with the program SEG • TFASTX and TFASTY are next used to compare the complete wormpep18 against the worm genome • After comparison Pseudogene matches were refined with the next step

Pipeline (continued) Step 3: reduction for overlaps on the genomic DNA • Significant matches of protein sequences to the DNA were reduced for redundancy where homologs match the same segment of DDNA • Matches are then sorted Step 4: Prevention of over counting for adjacent matches. • Initial matches may correspond to same pseudogene • To avoid over counting matches were realigned

Pipeline Step 5: Masking against Sanger Centre annotation and Transposon library. • Potential pseudogenes filtered for overlap with any other annotations in the Sanger Centre GFF files e.g. exons of genes, tandem or inverted repeats Step 6: Reduction for possible additional repeat elements • At this point there is a set of 3814 pseudogenic fragments

Pipeline (final step) Step 7: reducing threshold stringency • e-value match threshold reduced from .01 to .001 Check the web! • http://bioinfo.mbb.yale.edu/genome/womr/pseudogene • To find pseudogene population, the data can be viewed either by searching for protein name or viewing specific range in the chromosome

Size of Pseudogene Popuation • Composed of 2168 sequence, that’s about 12% of total gene complement • Factors that affect the size: 1. Dead copies of transposable elements 2. Size of pseudogene underestimated because pseudogenes with less obvious disablement aren't included. 3.Annotated genes might be pseudogenes because disablement is undetectable 4. Pseudogenes still part of functioning gene 5. Some pseudogenes arise due to sequencing errors 6. Possible genomic repeats

SUBPOPULATIONS • Highly expressed genes have fewer dead gene copies • The most reliable subset of the pseudogene population is about half the total for ΨG. • 39% of pseudogenes are intronic-these kinds of pseudogenes aren't ailing families of proteins

Chromosomal Distributions • More abundant near the ends of chromosome (the “arms”) • For each chromosome, there is a calculated proportion of dead genes

The data plot above indicates genome to genome over all age. • The percentage composition for each of the 20 amino acids is graphed in decreasing order of the implied amino acid composition in the pseudogene set. In the bottom part of the figure, the G difference for each amino acid composition is indicated by a bar.

Listed are the largest sequence families in the worm ranked by genes and pseudogenes • They’re named for their particular representative. Four of the 10 paralog genes family when ranked by number are functionally uncharacterized • Three of the pseudogenes top 10 are amongst the biggest families when we rank according to number of genes

Pseudofolds • These charts ranked in terms of implied structural pseudofolds • Proteins encoded by the worm genome have been assigned to globular domain folds • From the SCOP database

Why was this studied again? • To provide an initial estimate of the size distribution and characterizations of the pseudogene comparing C.elegans in attempt to estimate the total number in humans. • Found few pseudogenes that are apparently due to processing in the worm genome • Found large uncharacterized gene family that makes up 2/3 of dead genes • Arms of chromosome are an unreliable for encoding genes but more likely to spawn new proteins

EXPLORING DEAD GENES

EXPLORING DEAD GENES

Presentation Transcript

Exploring Genes

Genes

When is Dead Really Dead?

Genes

Genes

Dead Reckoning

Hox Genes (The Boss Genes)

Genes

Genes Function: genes in action

Genes

Genes

Junk DNA domestic imported (e.g., dead genes) (e.g., retroviruses)

Chapter 5 Exploring Genes and Genomes

Genes

Genes

Genes

exploring links between evolution and development: the Hox genes

GENES

Genes

Dead