210 likes | 225 Views
The use of 4-grams as vectors in protein classification and sequence comparison to measure sequence identity and detect close sequences with low sequence identity.
E N D
The use of 4-grams for Protein Classification and Sequence Comparison Dror Tobi, ShannChing Chen, Ivet Bahar
Each sequence or group of sequences is represented as a vector in the 204-dimensional space of 4-grams % of sequence identity between two sequences correlates with the cosine value of their vectors The 4-gram Concept QLIR a AASD FGTY 4-gram – a short sequence of four amino acids
Representation of Sequence(s) as 4-gram Vector(s) Three steps: • Calculating 4-gram frequencies in the examined DB • Calculating 4-gram frequencies for a given sequence or a given family of sequences • Creating a 4-gram vector using a weight function
AAAA 10929 AAAR 2230 . . VVVV 1402 1. Calculating 4-gram frequencies in DB As a reference DB we chose the Swiss-Prot. A table of the # of occurrences of each 4-gram was created The table enables us to calculate the database frequency of 4-gram i as
xxxx n xxxx n xxxx n 2. Calculating 4-gram frequencies of a sequence (or family) The 4-gram frequencies for a given sequence or a family of sequences is done using a hash table. Each 4-gram is entered into a hash table from which the 4-gram family frequency is calculated
The weight of 4-gram i for sequence/family f is defined as: If > then Wi > 0 If = then Wi = 0 If < then Wi < 0 3. The 4-gram weight function where is the average number of times 4-gram i appears in family f (no important contribution)
Building a 4-gram Vector (cont’d) 4-gram vector of length k is built from the k 4-grams with the highest | Wi | values. These 4-grams are referred to as the k most discriminative 4-grams. The selection of the k most discriminative 4-grams is done using a heap data structure. 1 2 k Identity xxxx1 w1 xxxx5 w5 xxxx9 w9 xxxx1001 w1001 xxxx1050 w1050 Weight The vector elements are sorted according to their 4-gram identity using quick sort algorithm.
Comparing two Vectors Vector similarity is measured by the cosine of the angle between the two vectors a xxxx1 w1 xxxx5 w5 xxxx9 w9 xxxx1001 w1001 xxxx1050 w1050 xxxx5 w5 xxxx6 w6 xxxx9 w9 xxxx1001 w1001 xxxx1056 w1056
EC4 family classification EC4 Test 1769 families (containing a total of 10,919 enzymes) defined at the EC level4 classification (at Expasy) were considered (*). A 4-gram vector (model, probe vector) was built for each EC4 family. The cosine between the probe vector for a given EC4 family and the 4-gram vector of each sequence in the Swiss-Prot was calculated. All sequences were rank-ordered based on their cosine values. (*) out of a total of ~4000 in SWISS-PROT release 27.7, excluding families that do not contain any sequences
Success Definition % success is defined as the % of family members having a cosine value higher then any non family sequence in the Swiss-Prot DB. Example: for a family (F00X) that has five members F001-5 A case of 80% success. Family members are colored blue. F001 0.567 F003 0.456 F005 0.354 F002 0.333 P0SD 0.301 F004 0.255 …..
EC 1.14.12.3 a case of failure EC 1.14.12.3 is a family of four proteins. When we tested this family against Swiss-Prot no family member had a higher cosine value than the highest cosine value of non-family members. EC 1.14.12.3 Phylogenetic tree • THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR PROTEINS: THE TWO SUBUNITS OF THE HYDROXYLASE COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB) AND A FERREDOXIN REDUCTASE (BEDA).
Sequence homogeneity is a prerequisite for successful 4-gram classification Sub Family Family vector Sub Family
Preliminary Conclusions • 4-gram classification is a fast way to classify/cluster sequences. 120,000 comparisons take ~4 min on regular desktop. • Sequence homogeneity within a family is a prerequisite for successful classification. • The EC classification classifies enzymes according to their function, which does not necessarily correlate with classification based upon sequence similarity.
4-grams uses in Sequence Search The 4-gram vector “as is” measures “sequence identity” and therefore can easily detect close sequences ( >55% identity) But what about sequences with low sequence identity (30-55%)?
Case of P03579 / P03581 43.6% identity; Global alignment score: 414 10 20 30 40 50 60 P03579 MPYTINSPSQFVYLSSAYADPVQLINLCTNALGNQFQTQQARTTVQQQFADAWKPVPSMT : :.: .:::.::.. ::: . ..: :: .:.::::..: ... . . : : P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQTQSGRDELREILIKSQVSVVSPI 10 20 30 40 50 60 70 80 90 100 110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFDTRNRIIEVDNQPAPNTTEIVNATQRVDDAT :::: .:.: . ... . ::::.: :::::.:::.:. .:.: .::..:.:::. P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDTRNRVIEVENSTNVTTAEQLNAVRRTDDAS 70 80 90 100 110 120 120 130 140 150 P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT- .::. ....: . :. :::.::...::.::::.: :::: : P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTPRTA 130 140 150 160 Cos(P03579, P03581) = 0.04
Identity Vector Homology Vector Improving Sensitivity using homology 4-grams P03579 MPYTINSPSQFVYLSSAY : :.: .:::.::.. : P03581 MAYSIPTPSQLVYFTENY Identity 4-grams Homology 4-grams SPSQ APSQ NPSQ TPSQ … SPSK
Including homology in vector comparison Homology Vector Query Sequence Identity Vector ah ai Unknown Sequence Score = cos( ai ) + lcos( ah )
Correlation between cosine value and Sequence alignment % identity
Conclusions The use of homology 4-grams improve detection of distant sequences (30 – 55% sequence identity). The 4-gram based method seems to be suitable also for sequence search. After precalculation of the sequences’ 4-gram vector it is possible to compare two sequences with time complexity of O(1).