UCSC Known Genes Version 3 Take 9

UCSC Known Genes Version 3Take 9

Known Gene History • Initially based on Genie predictions constrained by BLAT mRNA alignments. • David Kulp got busy at Affy. • Switched to RefSeq • Jim got paranoid Riken RNAs would take over • Fan built KG 1 • Mark got annoyed at low quality predictions • Fan & Mark built KG 2 • Jim got annoyed at missing genes • KG 3 • The perfect set … until KG 4.

Overall Pipeline • Get alignments etc. from database • Remove antibody fragments • Clean alignments, project to genome • Cluster into splicing graph • Add EST, Exoniphy, OrthoSplice info. • Walk unique transcripts out of graph. • Assign coding regions (CDS) to transcripts. • Classify into coding, antisense, noncoding. • Remove weak transcripts. • Assign accessions. • Build gene-centric database tables.

Genbank & Alignment Issues • Using global instead of local near-best alignment, also higher stringency. • Including all Genbank RNA, not just mRNA • These changes not yet reflected in Genbank mRNA/RefSeq tracks. • Collect data such as selenocysteine substitutions and alternative start codons from Genbank. These data are in the .ra files but not the SQL database.

Removing Antibody Var Regions • Chromosomes 2,14,22 contain antibody regions. • Thousands of transcripts for these in Genbank. • Gaps are from genomic rearrangements, not splicing. Millions of possibilities. • Identify regions by: • Searching for words like ‘immunoglobulin’ ‘variable’ to make initial set of Ab fragments. • Treat anything that overlaps these as Ab fragment too. • Cluster together putative Ab fragments. • Take 4 largest clusters as the 4 variable regions. (One is just a pseudogene of a real variable region.) • Remove all alignments in Ab clusters. • Replace with a single noncoding gene for each cluster near end of gene build.

Chr22 Ab Region (lambda light chain)

Cleaning, projecting alignments • BLAT sometimes leaves messy gappy ends. • New heuristic: • For gaps 6 base or less on both mRNA and genome, just ignore gap, filling in with genome if necessary. • Try to turn other gaps into introns if they are not already by wiggling one base on either side of gap. • Break up alignments at remaining gaps that are not intronic. Intronic gaps are at least 16 bases, and have gt/ag or gc/ag ends. • After break up throw away any pieces less than 18 bases long. • For refSeq mRNA only, join pieces back together after breaking up. Other mRNA can be joined by other transcripts (which may not suffer the same problems from polymorphism/error) • Consider applying similar heuristic in mRNA track.

Cleaning and projecting

Cluster into splicing graph • Make graph where vertices are begin/ends of exons, edges are exons and introns. • Multiple input transcripts can share vertices and edges. • Went over this in some detail a few weeks back…

Splicing graph and txWalk

Adding Evidence to Graph • Initial evidence for each edge comes from mRNAs. • If edge is supported by at least 2 ESTs. (Single EST likely is same clone as single RNA…) Just use spliced ESTs • Make graph in mouse and map via chains. Reinforce orthologous human edges. • Reinforce exon edges that overlap Exoniphy predictions. • Evidence weight: refSeq 100, each mRNA 2, est pair 1, mouse ortho 1, exoniphy 1.

Walking graph • Weight of 3 on an edge is good enough. • Rank input RNA by whether refSeq, and number of good edges they use. • If any good edges, output a transcript consisting of the edges used by the first RNA. • Output transcript based on next RNA if the good edges it uses have not been output in same order before. • Continue until reach last RNA.

Evidence, Walk, AltSplice

Assigning Coding Regions • Align UniProt and RefSeq proteins to txWalk transcripts. Mark regions they hit as possible CDS. • Align Genbank/RefSeq RNAs to txWalk transcripts, map CDS from RNA records as possible CDS. • Use bestorf program for another possible CDS. • Assign an ad-hoc score to each possible CDS, choose highest scoring. • More comparative genomics could really help here someday…

CDS Mapping, Filtering

Classifying and Weeding • The transcripts are classified into: • Coding: CDS survives trimming stage • Near-coding: overlap coding by at least 20 bases on same strand • Antisense: overlap coding by at least 20 bases on opposite strand • Noncoding: other transcripts • Near-coding transcripts that show signs of incomplete splicing (retained intron, bleeds > 100 bases into intron) are removed.

Assigning accessions • Initial temporary identifiers of form <chrom>.<cluster>.<tx>.<accession>, eg chr22.210.5.AB209301 • Make permanent identifiers of form TX12345678. • Find exact match in previous gene set, and reuse previous accession. • Find compatible match (all introns alike) in old gene set, reuse accession, bump version. • Make up new accession otherwise. • Record genes in old set not in new. • Version 7 -> version 9 mapping actually a good test of this: 53025 exact, 4732 lost, 3736 new, 464 compatible. • Move to UC1234567 format in v. 10?

Building gene-centric tables • mmBlastTab, rnBlastTab etc. homolog tables. Blastp best plus syntenic weeding. • kgXref and knownToXxx tables to relate gene to other databases and tables. • kgAlias table to help search on gene names. • gnfAtlas2Distance to measure expression similarity between genes for Gene Sorter. 3 other expression distance tables • humanVidalP2P and humanWankerP2P protein network distance tables. • knownCanonical/knownIsoform tables to help people selectively view alt-splicing. • pbXXX tables for proteome browser. • In all about 10 hours of compute and indexing.

The Plan • Next week • test preliminary integration on hg18a • resolve issues with proteome browser • Tinker on take 10, maybe take 11 • Week after • Integration of final gene build into hg18a • Move hg18.knownGenes to hg18.knownGenesOld • Swap hg18a tables into hg18. • Coming months • Continue to improve gene build. • Add new information from build into details pages. • Allow user filtering of which genes are shown • Allowing selection by names as well as ID’s in table browser. • Present at Cold Spring Harbor. Write up paper.

UCSC Known Genes Version 3 Take 9

UCSC Known Genes Version 3 Take 9

Presentation Transcript

PCARSS Version 3

Version 3

Version 3 Update

Inferring Function From Known Genes

UCSC Immunobrowser

Version: 3-2013

HL7 version 3

HL7 version 3

HL7 version 3

HL7 version 3

UCSC Known Genes Version 3 Take 10

HL7 version 3

HL7 version 3

HL7 version 3

HL7 version 3

4.3 version 3

3.1 version 3

UCSC History

Prosite and UCSC Genome Browser Exercise 3

CEMELA@UCSC

UCSC BIOINFORMATICS