TreeGeneBrowser: phylogenetic data mining of gene sequences from public databases

TreeGeneBrowser: phylogenetic data mining of gene sequences from public databases Jacobsen, Saleeba, Poidinger and Littlejohn University of Sydney, and Entigen Corp.

Ways to search the sequence databases • Text searching of the sequence annotations • Sequence similarity searching like BLAST • A new way that considering the database entries from different biological perspectives. This will gives researchers more options for finding sequence data that addresses specific questions

How? • Sequence databases frequently contain data with significant phylogenetic information. Each sequence can be viewed based on the species it was and the taxonomy of that species, but also by considering the degree of relatedness of taxa to each other. • Measurement of phylogenetic information cane be approached from two directions: history of a group of related organisms and evolution of a specific gene.

GOBASE sequence database • has been curated so that each sequence/feature is associated with its corresponding gene. As a sequence, sequences of the same name in GOBASE are homologs, and so are suitable for phylogenetic ananlysis. • Relies on the NCBI taxonomy • Could be used to determine: genen distribution across the taonomy and areas of the taxonomy under-represented for specific genes.

System and Methods • Two separate process: selection and editing of a user tree, followed by scoring of genes based on their presence in the saved user tree. • The initial user tree is a subsection of the NCBI taxonomy. • A gene or sequence is identified with a tip in the user tree if a descendant of that tip has the sequence associated with it.

Implementation • Gene data from GOBASE • Taxonomy data from NCBI • Back-end DBMS: mysql • Programming language: python

Databases • Two databases are constructed from NCBI taxonomy: one contains each node ID and its scientific name and the other has each node ID with its parent node ID, and an additional field with a list of all children of the node.

Databases continued Three databases are created from GOBASE: 1. Feature_id of each GOBASE entry, its taxon and gene name, as well the GenBank accession number of the underlying sequence. 2. Gene table: consists of an entry for each taxon that has a particular gene and additional derived entries for higher-level nodes in the NCBI taxonomy. 3. Feature table, contains a list of features Ids, their corresponding genen and taxon data, with additional derived entries as for the gene table.

User tree selection • User tree is selected initially by choosing a node in the NCBI taxonomy as the root node, and an initial user tree is presented, consisting of the first four layers of children from the ‘user root node’.

User tree editing Four ways to edit: • Nodes can be deleted (removed from the analysis), including all the children • Nodes can be contracted, removing the children of that node from the user tree, but remaining part of the analysis • Can be similarly be ‘expanded’, where the children of a node are included in the user tree • Simple editing of the topology by ‘cut and paste’

Algorithm • Genes from GOBASE are classified based on their gene name • Each gene is marked as present or absent • If the user tree tips are at subspecies level, the species level is also checked for gene sequences

Scoring for gene • Each gene is given a score based on which tips of the user tree it is present at. • Five different scorign schemes: • Equil-taxon weighting: a tip’s score is the inverse of the product of the number of children at each node from the root to that tip • Vane-Wright: a tip’s score is the inverse of the number of nodes from root to tip • May: a tip’s score is the inverse of the sum of the number of children at each node from root to tip • Pairwise unique: the pairwise distance between any two tipsis the number of nodes separating them. • Pairwise shared: the pairwise distance between any two tips is the inverse of the number of nodes they share from root.

Score for genes continued. • The scores for a given gene can’t be compared meaningfully between methods. • The results can be visualized in two major ways: • As listings of genes and their corrsponding scores • As the user tree with a mapping of either the presence of individual genes, or the overall distribution of sequences.

Discussion • The different methods score the genes similarly – the overall ranking is similar for all methods. • This method simplifies the identification of genes that may be useful for phylogenetic research. This speeds up the preparatory work for phylogenetic ananlysis considerably. • In cases where a gene has been identified as suitable, but representative sequences are not available for all groups considered, this method is useful not only for identifying the taxa without sequence data, but can also be used to prepare multiple sequence alignments of the known sequences.

Discussion continued. • This method is not suitable for those databases are nto curated to a high standard because there is no consistency in gene naming, and so on. • NCBI taxonomy concentrates only on around 70,000 of 1.5~1.8M species. So scores from this method do not reflect any species or taxonomic groups for which no sequence data is available.

TreeGeneBrowser: phylogenetic data mining of gene sequences from public databases

TreeGeneBrowser: phylogenetic data mining of gene sequences from public databases

Presentation Transcript

Data Mining – Intro

Privacy Issues in Scientific Workflow Provenance

Chapter 9. Mining Complex Types of Data

Data Mining

Lesson Four

Data Mining

Data Mining

Data Warehousing/Mining Comp 150 DW Chapter 6: Mining Association Rules in Large Databases

MINING THE GENE EXPRESSION MATRIX: INFERRING GENE RELATIONSHIPS FROM L ARGE SCALE GENE EXPRESSION DATA

Advanced Topics in Data Mining: Association Rules

Data Mining: Knowledge Discovery in Databases Peter van der Putten ALP Group, LIACS Pre-University C

Data Mining for Malware Detection Lecture #2 May 27, 2011

Data Mining Recommender

Gene Trees and Species Trees: Lessons from morning glories

Data Mining: Knowledge Discovery in Databases Peter van der Putten

Data Mining: Concepts and Techniques Mining sequence patterns in transactional databases

Data Mining: A KDD Process

KDD Overview

Data Mining with DB

Phylogenetic prediction of gene function

Data Mining: Extracting Knowledge from Past Data

Sequence Databases