210 likes | 320 Views
UCSC Known Genes Version 3 Take 9. Known Gene History. Initially based on Genie predictions constrained by BLAT mRNA alignments. David Kulp got busy at Affy. Switched to RefSeq Jim got paranoid Riken RNAs would take over Fan built KG 1 Mark got annoyed at low quality predictions
E N D
Known Gene History • Initially based on Genie predictions constrained by BLAT mRNA alignments. • David Kulp got busy at Affy. • Switched to RefSeq • Jim got paranoid Riken RNAs would take over • Fan built KG 1 • Mark got annoyed at low quality predictions • Fan & Mark built KG 2 • Jim got annoyed at missing genes • KG 3 • The perfect set … until KG 4.
Overall Pipeline • Get alignments etc. from database • Remove antibody fragments • Clean alignments, project to genome • Cluster into splicing graph • Add EST, Exoniphy, OrthoSplice info. • Walk unique transcripts out of graph. • Assign coding regions (CDS) to transcripts. • Classify into coding, antisense, noncoding. • Remove weak transcripts. • Assign accessions. • Build gene-centric database tables.
Genbank & Alignment Issues • Using global instead of local near-best alignment, also higher stringency. • Including all Genbank RNA, not just mRNA • These changes not yet reflected in Genbank mRNA/RefSeq tracks. • Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the .ra files but not the SQL database.
Removing Antibody Var Regions • Chromosomes 2,14,22 contain antibody regions. • Thousands of transcripts for these in Genbank. • Gaps are from genomic rearrangements, not splicing. Millions of possibilities. • Identify regions by: • Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. • Treat anything that overlaps these as Ab fragment too. • Cluster together putative Ab fragments. • Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) • Remove all alignments in Ab clusters. • Replace with a single noncoding gene for each cluster near end of gene build.
Cleaning, projecting alignments • BLAT sometimes leaves messy gappy ends. • New heuristic: • For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary. • Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap. • Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends. • After break up throw away any pieces less than 18 bases long. • For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error) • Consider applying similar heuristic in mRNA track.
Cluster into splicing graph • Make graph where vertices are begin/ends of exons, edges are exons and introns. • Multiple input transcripts can share vertices and edges. • Went over this in some detail a few weeks back…
Adding Evidence to Graph • Initial evidence for each edge comes from mRNAs. • If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs • Make graph in mouse and map via chains. Reinforce orthologous human edges. • Reinforce exon edges that overlap Exoniphy predictions. • Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.
Walking graph • Weight of 3 on an edge is good enough. • Rank input RNA by whether refSeq, and number of good edges they use. • If any good edges, output a transcript consisting of the edges used by the first RNA. • Output transcript based on next RNA if the good edges it uses have not been output in same order before. • Continue until reach last RNA.
Assigning Coding Regions • Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS. • Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS. • Use bestorf program for another possible CDS. • Assign an ad-hoc score to each possible CDS, choose highest scoring. • More comparative genomics could really help here someday…
Classifying and Weeding • The transcripts are classified into: • Coding: CDS survives trimming stage • Near-coding: overlap coding by at least 20 bases on same strand • Antisense: overlap coding by at least 20 bases on opposite strand • Noncoding: other transcripts • Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.
Assigning accessions • Initial temporary identifiers of form <chrom>.<cluster>.<tx>.<accession>, eg chr22.210.5.AB209301 • Make permanent identifiers of form TX12345678. • Find exact match in previous gene set, and reuse previous accession. • Find compatible match (all introns alike) in old gene set, reuse accession, bump version. • Make up new accession otherwise. • Record genes in old set not in new. • Version 7 -> version 9 mapping actually a good test of this: 53025 exact, 4732 lost, 3736 new, 464 compatible. • Move to UC1234567 format in v. 10?
Building gene-centric tables • mmBlastTab, rnBlastTab etc. homolog tables. Blastp best plus syntenic weeding. • kgXref and knownToXxx tables to relate gene to other databases and tables. • kgAlias table to help search on gene names. • gnfAtlas2Distance to measure expression similarity between genes for Gene Sorter. 3 other expression distance tables • humanVidalP2P and humanWankerP2P protein network distance tables. • knownCanonical/knownIsoform tables to help people selectively view alt-splicing. • pbXXX tables for proteome browser. • In all about 10 hours of compute and indexing.
The Plan • Next week • test preliminary integration on hg18a • resolve issues with proteome browser • Tinker on take 10, maybe take 11 • Week after • Integration of final gene build into hg18a • Move hg18.knownGenes to hg18.knownGenesOld • Swap hg18a tables into hg18. • Coming months • Continue to improve gene build. • Add new information from build into details pages. • Allow user filtering of which genes are shown • Allowing selection by names as well as ID’s in table browser. • Present at Cold Spring Harbor. Write up paper.