230 likes | 243 Views
ASTD. ‘ Alternative Splicing and Transcript Diversity database ’. What/who are we?. Firstly AltExtron Secondly ASD - Alternative splicing database, and the AltSplice pipeline
E N D
ASTD ‘Alternative Splicing and Transcript Diversity database’
What/who are we? • Firstly AltExtron • Secondly ASD - Alternative splicing database, and the AltSplice pipeline • database of alternative splice events and the resultant isoform splice patterns of genes from human, and other model species. • Thirdly, for grant purposes, ATD - Alternative transcript diversity database, and the AltTrans pipeline • formation of transcript isoforms on a genome-wide scale by creating a value-added database of full-length alternate transcripts from human and other model species. • We also host the AEdb database – manual annotations • the two, ASD and ATD, blended into 1 pipeline, so now we are: ASTD Alternative splicing and transcript diversity database www.ebi.ac.uk/astd ASTD
Pipeline in a nutshell Poly(A) Pipeline 1. Ensembl gene slices & EMBL EST/mRNA/HTC/HInv download TSS Pipeline Peptide Pipeline 2. Immunoglobulin filtering (Blast) 9. Data generation SNP Pipeline 3. Redundant gene filtering (Blat) Conservation Pipeline 8. Events prediction 4. Genes vs EST/mRNA Alignment (Blast) 7. Splice patterns delineation 6. Intron/exon delineation 5. HSP Collection ASTD
Limitations of the pipeline … • Pipeline defines consensus splice sites • True biology is removed: • dicistronic transcripts • Nested genes • Single exon genes • Small exons • Large introns Manual annotation would resolve these issues ….. ASTD
Improvements … • New web interfaces – user friendly • New database schema that is normalised, extendable and maintainable • Pipeline improvements: some steps now automated, bugs corrected, some improvements and blat replaces blast for filtering redundant genes • Database allows external features to be included (Ensembl and VEGA annotations) to compare to our transcripts • Schema allows export of data in standard format – GTF2 and GFF3, EMBL flat file format, fasta format, and excel spreadsheet • Transcripts for complete genome, not restricted to those with alternative splice events • Introduction of unique identifiers • Addition of datasets as input to pipeline: HTC and HInv • Extension of 5’ and 3’ UTR to capture more TSS and poly(A) • Annotation of TSS (Align 5’ capped mRNAs from human and mouse to transcript ) and poly(A) to generate full length transcripts ASTD
www.ebi.ac.uk/astd - Query tools Three query tools are available to retrieve entries: • Simple text search on the main page • Genome browsing • Advanced search ASTD
Gene information ASTD
Splice event … 1 ASTD
Splice event … 2 ASTD
Peptide information ASTD
Statistics • Human Number of genes with an ASTD transcript : 16715 Number of genes with an ASTD transcription_start_site : 4936 Number of genes with an ASTD polyA_site : 15376 Number of genes with an ASTD splicing event : 11316 Number of genes with multiple ASTD transcripts : 14101 Proportion of genes undergoing alternative splicing: 68 % Proportion of genes undergoing alternative polyadenylation: 92 % Proportion of genes undergoing alternative transcription_start_sites: 30 % • Mouse Number of genes with an ASTD transcript : 16491 Number of genes with an ASTD transcription_start_site : 948 Number of genes with an ASTD polyA_site : 13556 Number of genes with an ASTD splicing event : 9474 Number of genes with multiple ASTD transcripts : 13028 Proportion of genes undergoing alternative splicing: 57 % Proportion of genes undergoing alternative polyadenylation: 82 % Proportion of genes undergoing alternative transcription_start_sites: 6 % • Rat Number of genes with an ASTD transcript : 10424 Number of genes with an ASTD transcription_start_site : 503 Number of genes with an ASTD polyA_site : 8842 Number of genes with an ASTD splicing event : 2865 Number of genes with multiple ASTD transcripts : 6344 Proportion of genes undergoing alternative splicing: 27 % Proportion of genes undergoing alternative polyadenylation: 85 % Proportion of genes undergoing alternative transcription_start_sites: 5 % ASTD
Controlled vocabularies/ontologies • GO • SOFA • eVOC • Splice event ontology • MeSH terms ASTD
Future … 1 • Addition of new species • Experimental validation of transcript structure and alternative poly(A)s • Use EMBL CDS as another source of alignments to the genome • More frequent releases – every 3 months • Addition of regulatory motifs – ESS, ESE, ISS and ISE • microRNA target sites from the EURASNET NoE (University Basel) ASTD
Future … 2 • Introduction of unique identifiers means: • Addition as xref in EMBL so transcripts in the INSDC can be grouped into one gene • Addition into UniParc so translations can be linked to UniProt IsoIds and again grouped as being variants of one gene • UniParc translations also undergo full InterPro scan, TM and SignalP predictions so data is precomputed and not done on the fly ASTD
Future … 3 • The EBI sequence database group and Ensembl have merged making the Hinxton Sequencing Forum (HSF) • Outcome is that ASTD will be vehicle to augment the Ensembl transcript views • Full length transcripts with TSS, splice events and polyA • Definition of the ‘major transcript set’ using annotation of features to transcripts, eg expression state, exon array, splice junction array evidence etc • VEGA/Havana annotations also included • Time scale - within 2 years ASTD
Acknowledgements • The ASTD Team: • Gautier Koscielny • Vincent Le Texier • Eleanor Whitfield • Chellapa Gopalakrishnan • Vasudev Kumanduri • Sequence Database Group and External Services • ASD consortium (Stefan Stamm for AEdb) • ATD consortium (Daniel Gautheret for AltPAS) • EURASNET consortium • The ASTD project at EBI is supported by a grant from the EC: Eurasnet Network of Excellence (LSHG-CT-2005-518238). ASTD