320 likes | 333 Views
This study focuses on aligning transcribed sequences to the human and mouse genomes to delineate gene boundaries, investigate alternative splicing, and validate DoTS assemblies. Tools like BLAT and alignment quality criteria are employed to identify accurate alignments and estimate total gene numbers.
E N D
Aligning Transcribed Sequences to the Human and Mouse Genomes Yongchang Gan, Jonathan Crabtree, Chris Stoeckert Computational Biology and Informatics Laboratory (CBIL) Center for Bioinformatics University of Pennsylvania
The Transcribed Sequences • dbEST expressed sequence tags (ESTs) • ~4 million human • ~2.5 million mouse • Highly variable quality • GenBank mRNAs and RefSeqs • Many are “full length”, high quality • Includes RIKEN cDNAs • Did not include GenBank HTC division
DoTS: Database of Transcribed Sequences • Cluster ESTs & mRNAs by similarity • Assemble the clusters with CAP4 • Goal is to produce one sequence per transcript • Annotate resulting consensus seqs. • Predict protein sequences • Run BLAST searches • Predict GO function • Link to RH maps, gene trap cell lines, expression data, MGI, GeneCards, etc. • Results at http://www.allgenes.org
DoTS “Singletons” • Sequences that do not assemble with anything else in the database • Singletons are usually ESTs • Represent either 5’ or 3’ end of a gene
The Genomes: Human • Recent events • June 2000: “working drafts” announced • Feb. 2001: first analyses published • Feb. 2002: UCSC exits assembly business • Current public draft sequence • July, 2002: NCBI Build #30 • June 28, 2002 freeze of GenBank data • 87% finished seq., est. 94-97% coverage
The Genomes: Mouse • Recent events (public sequence) • Late 2000: shotgun sequencing begun • Late 2001: first assemblies created • April 2002: Arachne chosen over Phusion • Current public draft sequence • April, 2002: MGSCv3 • February, 2002 freeze of ~7X shotgun • Estimated 90-95% coverage
Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA)
Aligning transcripts with DNA 5’ UTR CDS 3’ UTR Transcribed sequences (e.g., mRNA) Genome (i.e., DNA) exon 1 exon 2 exon 3 *** DRAMATIZATION ***
What are the goals? • Find genes & delineate their boundaries • Investigate alternative splicing • Validate DoTS assemblies • Gain insight into sources of error • Assess whether anything is gained by assembling ESTs before aligning them
Potential “unsplicing” tools • BLAST • Good general-purpose local alignment tool • But not well-suited to this specific task • Special-purpose alignment tools • e.g., est2genome (Birney, Durbin), est_genome (Mott), sim4 (Florea et al.) • Perform well, but are very slow
Unsplicing: a first attempt • BLAST-sim4 heuristic algorithm • Employs a two-step approach • BLASTN - find candidate locations • sim4 – perform precise alignments • Much faster than sim4 alone • But still slow for whole-genome analysis • Similar in spirit to Spidey (Wheelan et al.), post-processes BLASTN results
Unsplicing: BLAT • BLAT: BLAST-Like Alignment Tool • Written by Jim Kent at UCSC • Indexes target db, not query sequence • Takes advantage of additional constraints • Adjusts exon boundaries using splice signals • Attempts to locate small exons • 500x speedup with no loss of sensitivity
Overview of alignment process • BLAT RefSeq mRNAs + DoTS sequences against respective genomes • Load alignments into database • Compute summary information • Including alignment “quality” • Merge selected alignments into “genes” • Eliminates redundancy in DoTS • Provides estimate of total gene number
BLAT Alignments: first step • Default parameters, repeats masked • All with >=10% of query loaded into db • Summary information computed • e.g., max_query_gap, max_target_gap • polyA tails detected, 3’ and 5’ (!) • Alignment quality
Alignment Quality • This results in many alignments • How to identify those that represent the actual location(s) of each transcript? • Assuming that: • The transcribed sequence is real • The corresponding genomic sequence(s) is/are accurate and complete • Use a heuristic approach
Defining Alignment Quality • (1) “Very good” • >= 95% average sequence identity • max_query_gap <= 5 bp • Both ends are consistent: • no more than 10 bp mismatch unless polyA • polyA rule cannot be used on both ends
Control experiment #1 • Compared: • “Very good” RefSeq alignments to hChr22/mChr5 • mRNA alignments in UCSC annotation database • FP: ~0 FN: ~18% and ~35% • (2) “Very good, but with gaps” • Same as “very good” but mismatches are allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length at the ends.) • New false negative rates: ~15% and 13%
Control experiment #2 • RefSeqs that had “very good” alignments alone, but not when assembled with other sequences: • hChr22: 98/255 (38%) • mChr5: 109/271 (40%) • Mostly due to problems at ends of DoTS seqs. • (3) “Good” • Same as “very good w/ gaps” but allow: • max_query_gap <= 15 bp (vs. 5 bp) • Up to 50 bp of mismatch at each end (vs. 10 bp) • Reduces to 25/255 (~10%) and 33/271 (~12%)
Alignment statistics: human • hDoTS (08/02) vs. human genome (NCBI 30) • Total DoTS sequences: 859,545 (~230,000) • Alignments loaded: 5,544,300 / 8,975,529
Alignment statistics: mouse • mDoTS (07/02) vs mouse genome (MGSCv3) • Total DoTS sequences: 579,906 (~129,000) • Alignments loaded: 3,208,572/4,663,903
Merging adjacent/overlapping alignments into “genes” • Select BLAT alignments • Parameters: min. quality, min_target_gap • Merge overlapping alignments • Merge nearby alignments where an assembly in each has an EST from a common clone • Parameter: max distance (500 kb) • Merge nearby alignments • Parameter: max distance (75 bp) • Only merge alignments on the same strand • Identify genes with an intron of at least 15bp
Algorithm Calibration • Human chr22q (~34Mb) as test case • Sanger annotation release 2.3: 832 genes (341 gene, 118 gene_segment, 112 related, 109 predicted, 152 pseudogenes) • Focus on DiGeorge Critical Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-33 genes based on literature (Sanger: 44 genes with 33 known) *Used DoTS 02/02 release vs Golden Path 12/01 release, and old BlatAlignment table (limited quality classes).
Known problems/issues • Incorrectly oriented DoTS assemblies • Distinguishing single-exon genes from genomic contaminants, antisense and/or functional non-coding RNAs • Large number of ESTs have no alignments at all [above 10% threshold] • Currently investigating why this is so…
Current and future work • Detailed assessment of results in 14Mb of mouse chr. 5 (CBIL + Bucan lab.) • Augment alignments with other sequence signals (Hatzigeorgiou lab.) • Incorporate alignments into DoTS build process from the outset
Acknowledgements • BLAT Alignments/Gene Merging • Yongchang Gan (see poster!) • Database of Transcribed Sequences (DoTS) • Brian Brunk, Steve Fischer, Deborah Pinney • Mouse Chr. 5 annotation project • Joan Mazzarelli • Maja Bucan lab. • Artemis Hatzigeorgiou lab. • Chris Stoeckert (PI, CBIL)
Is EST assembly still relevant? • Not every organism has genome project • EST sequencing is still a relatively cheap way to survey a transcriptome • Though array-based approaches are also very powerful, if the sequence is known • Not every EST will necessarily align to the draft genome; may want to cluster the rest • Annotation component of DoTS is useful, regardless of the assembly method