100 likes | 112 Views
From DoTS Assemblies to Genes via Genomic Alignment. BLAT consensus sequences vs genomic Load alignments with 10% cutoff into GUS Compute alignment “quality”: 1 = Very good 2 = Very good with gaps 3 = Good 4 = Not so good Merge selected alignments into “genes”. BLATAlignmentQuality.
E N D
From DoTS Assemblies to Genes via Genomic Alignment • BLAT consensus sequences vs genomic • Load alignments with 10% cutoff into GUS • Compute alignment “quality”: • 1 = Very good • 2 = Very good with gaps • 3 = Good • 4 = Not so good • Merge selected alignments into “genes”
BLATAlignmentQuality • Very good (formerly “consistent”) • >= 95% identity (average) • max_query_gap <= 5 • both ends consistent • no more than 10bp mismatch unless polyA • not polyA on both ends
BLATAlignmentQuality II • Very good with gaps • same as very good but internal and end mismatches allowed if there is a sufficiently large genomic sequence gap (within 10X mismatch length for ends) • Good • same as very good, but with max_query_gap <= 15 (allow large internal gaps if there is a sufficiently large genomic sequence gap), and inconsistent ends allowed if unaligned_bases <= 50 • Not so good • everything else
“Gene” creation algorithm • Select BLAT alignments • Parameters: min quality, genomic region • Merge overlapping alignments • Merge nearby alignments with at least one EST sequence in each assembly from common clone • Parameter: max distance (default 20kb) • Merge nearby alignments • Parameter: max distance (default 20bp)
Human Chromosome 22 • As test case to calibrate algorithm • December 2001 Golden Path release (NCBI build 28?) • Human DoTS February 2002 release (820965 consensus sequences) SQL> select count(*) from blatalignment b, virtualsequence v 2 where b.target_na_sequence_id = v.na_sequence_id 3 and v.external_db_id = 4792 and v.chromosome = '22' 4 and b.target_external_db_id = 4792 and b.query_table_id = 56 5 and b.query_taxon_id = 8; COUNT(*) = 129619
Focus on DiGeorge Critical Region • DGCR6 to ZNF74 (~ 1.6Mb) • Contains 24-44 genes based on literature (including latest Sanger annotation) • Number of genes by our algorithm: 47 • Input alignments: very good, multispan • Merge by overlap: on • Merge by clone: 20kb (default) • Merge by proximity: off
Choosing parameters # DiGeorge Chromosome Region (DGCR6 - ZNF74, 1.6Mb) # CBIL Gene Param* Num CBIL* Num Sanger* Num Overlap* Avg %overlap* qf=4, am, cm=10k 27/50 26/44 28 88.7 vs 71.3 qf=4, am, cm=20k 24/47 26/44 27 81.4 vs 75.5 qf=4, am, cm=50k 20/39 25/44 26 63.8 vs 77.6 qf=6, am, cm=10k 26/69 29/44 30 77.7 vs 75.9 qf=6, am, cm=20k 25/66 28/44 31 69.8 vs 80.5 qf=6, am, cm=50k 17/54 24/44 25 53.0 vs 87.4 # Chr22 (Chr22q ~34M) # CBIL Gene Param* Num CBIL* Num Sanger* Num Overlap* Avg %overlap* qf=4, am, cm=20k 335/737 352/829 383 70.7 vs 72.4 qf=6, am, cm=20k 327/1074 377/829 399 64.9 vs 81.0 * qf: is quality filter for choosing Genomic Alignments of desired quality for gene boundary definition. 4: consistent and multi-span, 6: ok and multi-span * am: is alignment overlap mediated merge of DoTS assemblies * cm: clone information mediated merge of DoTS assemblies within specified distance * i/j: j is total number of genes, and i is number of genes with overlap * only overlaps of at least 5% of the genomic length of both genes are counted * Avg %overlap: (1) same as above; (2) first number is w.r.t. CBIL gene, second Sanger.
Mouse Chromosome 5 • February 2002 Golden Path release (MIT Arachne build 3?) • Mouse DoTS January 7, 2002 release (537403 consensus sequences) • ENSEMBL/PHUSION assembly: • Known Ensembl Genes: 826 • Novel Ensembl Genes: 448 • Length: 151006098 bp
Focus on Mouse Chr5 proximal • Telomere to Clock (1-83,965,868) • UCSC RefSeqs: 178 • Number of genes by our algorithm: 449 • Input alignments: very good, multispan • Merge by overlap: on • Merge by clone: 20kb (default) • Merge by proximity: off
In progress • Revised BLATAlignment table • Alignment of new releases of Human DoTS (Mouse already done) • Alignments against Celera scaffolds • Redo gene merge with new alignments: all good and above