190 likes | 293 Views
Midterm Project. Database Schema. GeneIDTable Information about “ gene ” and corresponding “ protein ” gene_id, gene_name, gene_seq, protein_id, protein_name, protein_seq, gene_type gene_id – primary key (type varchar(255)) gene_type type varchar(255)
E N D
Database Schema • GeneIDTable • Information about “gene” and corresponding “protein” • gene_id, gene_name, gene_seq, protein_id, protein_name, protein_seq, gene_type • gene_id – primary key (type varchar(255)) • gene_type type varchar(255) • All other entries are of type longtext
Database Schema • GeneFuncTable • Information about “gene functions” • gene_id, gene_fun, comment • gene_id – foreign key • All entries are of type longtext
Database Schema • ProteinFuncTable • Information about “protein functions” • protein_id, protein_fun, comment • All entries are of type longtext
Database Schema • PathwayFuncTable • Information about “pathway functions” • pathway_id, pathway_name, pathway_fun, pathway_loc, comment All entries are of type longtext
Database Schema • PathwayTable • Information about “gene pathway association” • gene_id, pathway_id • gene_id type varchar(255) • pathway_id type longtext
Database Schema • BiologicalProcessTable • Gene Ontology related table • Information about “biological processes” of a particular gene • gene_id, GO_num, biological_process • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext
Database Schema • CellularComponentTable • Gene Ontology related table • Information about “cellular component” • gene_id, GO_num, cellular_component • gene_id – foreign key (type varchar(255)) • All other entries are of type longtext
Database Schema • MolecularFunctionTable • Gene Ontology related table • Information about “molecular functions” • gene_id, GO_num, molecular_function • gene_id – foreign key (type varchar(255)) • All entries are of type longtext
Steps to Follow – Step 1 • Get the RefSeq Accession Number of your species from the NCBI Genome database • e.g. NC_000913 for Escherichia Coli K12
Steps to Follow – Step 2 • Downloading files needed using the NCBI ftp site (ftp://ftp.ncbi.nlm.nih.gov) • genomes/Bacteria/[species name]/[RefSeq #].gbk (main information for genes and proteins and GO functions) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.gbk • genomes/Bacteria/[species name]/[RefSeq #].ffn (gene sequence) • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_000913.ffn
Steps to Follow – Step 3 • Go to KEGG selected organisms (http://www.genome.jp/kegg/catalog/org_list.html) • Find your species and click the second column of the species (e.g. eco for E Coli) • Go to “pathway maps” to get pathway information to put into the PathwayFunc table
Steps to Follow – Step 4 • Use eutils function of NCBI Entrez to get the file that contains gene pathway association (http://eutils.ncbi.nlm.nih.gov/entrez/eutils/) • Use esearch to search your species in the gene database http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=database&term=query&usehistory=y • Use efetch to fetch the result file • http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=database&WebEnv=WebEnvString&query_key=key
Steps to Follow – Step 5 • Edit .gbk file to remove the beginning and the end part • Parse the .gbk and the .ffn file to fill all the tables except the PathwayFunc table and Pathway table • Link to the sample parser file • Parse.java
Steps to Follow – Step 6 • Parse the eutils resulting file to get the gene pathway association • Link to the sample parsePath file • ParsePath.java
Database Name Format • Example species Escherichia Coli K12 • Species name: Escherichia_Coli_K12 • Database name: escherichia_coli_k12
Sample Output File • outputFile.txt (output file after parsing .gbk and .ffn files) • outputPath.txt (output file after parsing gene pathway association file) • PathwayFunc.txt (output file after analyzing KEGG pathways)
To Find the Number of Genes • Search your species in NCBI gene database • e.g. Escherichia Coli K12 [orgn] • Check the number of genes in your result with this number
Submit your project (the 3 output files, the parsers if any changes) to: • vgummulu@cise.ufl.edu • Any questions: • yizhang@cise.ufl.edu • anupamd@ufl.edu