150 likes | 342 Views
A Novel Knowledge Based Method to Predicting Transcription Factor Targets. cyd@picb.ac.cn. Background.
E N D
A Novel Knowledge Based Method to Predicting Transcription Factor Targets cyd@picb.ac.cn
Background • It is termed as the major aspect of transcription regulation that transcription factor regulates target genes’ expression. Extensive efforts have been made in discovering transcription factors’ target genes both in wet and dry labs. Since transcription factors as well as their targets may participate the same biological pathways and share similar biological functions, we can inference the regulatory relationship by analyzing the Gene Ontology annotations and potential transcription factor DNA binding sites. • Hence, the computational method to predict potential transcription factor target we developed could be useful in transcription regulatory mechanism researches.
Background • TF-TFBS-TFT triplets • Transcription factors(TF) regulate transcription factor target(TFT) through binding to transcription factor DNA binding sites(TFBS).
Background • Our predicting strategy GO encoding TF False Predictor 0/1 encoding Hybridization Space TFBS True GO encoding TFT It is TRUE that Transcription factors(TF) regulate transcription factor target(TFT) through binding to transcription factor DNA binding sites(TFBS).
Materials and Methods • Positive dataset • For transcription factors as well as their targets, binding sites, the original dataset came from TRANSFAC v7.0. Then the original dataset was filtered as following steps: (1) Remove the TFs with no SwissProt Accessions, as well as TFTs. (2) Remove TFBSs with length less than 5bp or longer than 25bp. (3) Finally, positive dataset with 3430 TF-TFBS-TFT triplet which covered 143TF, 571TFBS and 1416TFT was built • Negative dataset • Negative dataset was randomly generated based on positive dataset as following steps; (1) Random number i was generated from uniform distribution on interval [1,143] , j from interval [1,571] and k from [1,1416] . (2) The ith TF, jth TFBS and kth TFT was selected from the positive dataset. Then a new triple was constructed through combining those three elements. This new triple is ignored if it does appear in the original positive dataset, otherwise would be pushed in the negative dataset. (3) Repeat step 1,2 and 3 until the size of negative dataset reached 6860, which is two times that of positive dataset. (3) Finally, a negative dataset with 6860 TF-TFBS-TFT triples which covered 140TF, 559TFBS and 1317TFT was obtained
Numeric representation system • TF Gene Ontology representation system • By using Uniprot2GO mapping provided by GOA Uniprot 34.0 on November 21st 2005 [ http://www.ebi.ac.uk/GOA/ ], functional annotations of TFs provided by GO were obtained. • Each TF can be represented in a 9525D (Dimensional) vector through using each of the 9525 GO items as the vector base, e.g. for a given TF that hit a GO item which is the ith number of the 9525 GO items, then the ith component of the 9525D vector will be set to 1, otherwise 0. • Thus, the TF sample can be formulated as where,
Numeric representation system • TFTs are encoded by using the same approach as TFs • Each TFT can be represented in a 9525D (Dimensional) vector through using each of the 9525 GO items as the vector base where,
Numeric representation system • Short nucleotide sequence TFBS are encoded using the 0/1 encoding system which can be briefed as follows • Firstly, TFBSs with length less than 25bp are extended to exact 25bp through adding ‘N’ suffixes, e.g. ‘CCCCACGTAGCTAGACGTAG’ will be extended to ‘CCCCACGTAGCTAGACGTAGNNNNN’, meanwhile make no change for these TFBSs with length exact 25bp. • Then, these TFBSs can be represented in a 100D (Dimensional) vector, e.g. ‘ACGTAGCTAGACGTAGCTAGNNNNN’ will be represented in a 100D binary vector as 0010'0010'0010'0010'0001'0010'0100'1000'0001'0100'0010'1000'0001'0100'0001'0010'0100'1000'0001'0000'0000'0000'0000'0000 , meanwhile each nucleotide was encoded with a 4D binary vector as • Finally, each TFBS can be formulated as where, d can be either 0 or 1.
The hybridization space • To fascinate predicting the interactions between TF and TFT, a numeric representation to cover TF-TFBS-TFT triplet is developed. This can be done as follows. Suppose Tx , Dy and Gz are the xth TF, yth TFBS, zth TFT, respectively. The x − y − z TF-TFBS-TFT triplet TDG( x, y, z ) can be expressed as
The hybridization space TF regulate TFT through binding TFBS 1 Predictor 0 NOT
The predictor • The Nearest Neighbor Algorithm • Once the numeric representation is built, the predictor performed in this contribution can be briefly as follows. Suppose there are N triplets ( R1 R2 , ..., Ri ,..., RN ) with known classification label ( L1 L2 , ..., Li,..., LN) , where Li ∈ {true false} and true indicates it is indeed a true triplet that TF act on TFT through TFBS, and false otherwise. • Given a novel triplet R , is it true? To investigate this problem, distance D(R, Ri) (1≤ i ≤ N ) is defined where, Ri · R is the inner-product of R and Ri , ||R|| and ||Ri|| are the modulus of R and Ri , respectively.
The predictor • The Nearest Neighbor Algorithm • Once the distance is calculated. The category of R can be predicted to be same as that of its nearest neighbor. • If there is a tie, which means there are more than one nearest neighbor.
Results and Discussion Jackknife cross-validation test Excluding the transcription factors with no GO annotations as well as the transcription factor targets, finally 19150D (9525+9525+100) vector were built for 2693 true triples and 5337 artificial triplets. The result is obtained when k is set to 0.5.
Conclusion • Identifying transcription factor’s targets is one of the basic researches in transcription regulatory area. In this contribution, a knowledge based method was proposed to identify TF-TFT relationships through integrating Gene Ontology annotations and transcription factor DNA binding preference. The predictor we built acquired a fairly good performance as 96.5%, which indicates the computational method we developed could be a useful tool in transcription regulatory mechanism researches.