The use of 4-grams for Protein Classification and Sequence Comparison

The use of 4-grams for Protein Classification and Sequence Comparison Dror Tobi, ShannChing Chen, Ivet Bahar

Each sequence or group of sequences is represented as a vector in the 204-dimensional space of 4-grams % of sequence identity between two sequences correlates with the cosine value of their vectors The 4-gram Concept QLIR a AASD FGTY 4-gram – a short sequence of four amino acids

Representation of Sequence(s) as 4-gram Vector(s) Three steps: • Calculating 4-gram frequencies in the examined DB • Calculating 4-gram frequencies for a given sequence or a given family of sequences • Creating a 4-gram vector using a weight function

AAAA 10929 AAAR 2230 . . VVVV 1402 1. Calculating 4-gram frequencies in DB As a reference DB we chose the Swiss-Prot. A table of the # of occurrences of each 4-gram was created The table enables us to calculate the database frequency of 4-gram i as

xxxx n xxxx n xxxx n 2. Calculating 4-gram frequencies of a sequence (or family) The 4-gram frequencies for a given sequence or a family of sequences is done using a hash table. Each 4-gram is entered into a hash table from which the 4-gram family frequency is calculated

The weight of 4-gram i for sequence/family f is defined as: If > then Wi > 0 If = then Wi = 0 If < then Wi < 0 3. The 4-gram weight function where is the average number of times 4-gram i appears in family f (no important contribution)

Building a 4-gram Vector (cont’d) 4-gram vector of length k is built from the k 4-grams with the highest | Wi | values. These 4-grams are referred to as the k most discriminative 4-grams. The selection of the k most discriminative 4-grams is done using a heap data structure. 1 2 k Identity xxxx1 w1 xxxx5 w5 xxxx9 w9 xxxx1001 w1001 xxxx1050 w1050 Weight The vector elements are sorted according to their 4-gram identity using quick sort algorithm.

Comparing two Vectors Vector similarity is measured by the cosine of the angle between the two vectors a xxxx1 w1 xxxx5 w5 xxxx9 w9 xxxx1001 w1001 xxxx1050 w1050 xxxx5 w5 xxxx6 w6 xxxx9 w9 xxxx1001 w1001 xxxx1056 w1056

EC4 family classification EC4 Test 1769 families (containing a total of 10,919 enzymes) defined at the EC level4 classification (at Expasy) were considered (*). A 4-gram vector (model, probe vector) was built for each EC4 family. The cosine between the probe vector for a given EC4 family and the 4-gram vector of each sequence in the Swiss-Prot was calculated. All sequences were rank-ordered based on their cosine values. (*) out of a total of ~4000 in SWISS-PROT release 27.7, excluding families that do not contain any sequences

Success Definition % success is defined as the % of family members having a cosine value higher then any non family sequence in the Swiss-Prot DB. Example: for a family (F00X) that has five members F001-5 A case of 80% success. Family members are colored blue. F001 0.567 F003 0.456 F005 0.354 F002 0.333 P0SD 0.301 F004 0.255 …..

EC4 Initial Results

EC 1.14.12.3 a case of failure EC 1.14.12.3 is a family of four proteins. When we tested this family against Swiss-Prot no family member had a higher cosine value than the highest cosine value of non-family members. EC 1.14.12.3 Phylogenetic tree • THIS DIOXYGENASE SYSTEM CONSISTS OF FOUR PROTEINS: THE TWO SUBUNITS OF THE HYDROXYLASE COMPONENT (BEDC1 AND BEDC2), A FERREDOXIN (BEDB) AND A FERREDOXIN REDUCTASE (BEDA).

Sequence homogeneity is a prerequisite for successful 4-gram classification Sub Family Family vector Sub Family

Preliminary Conclusions • 4-gram classification is a fast way to classify/cluster sequences. 120,000 comparisons take ~4 min on regular desktop. • Sequence homogeneity within a family is a prerequisite for successful classification. • The EC classification classifies enzymes according to their function, which does not necessarily correlate with classification based upon sequence similarity.

4-grams uses in Sequence Search The 4-gram vector “as is” measures “sequence identity” and therefore can easily detect close sequences ( >55% identity) But what about sequences with low sequence identity (30-55%)?

Case of P03579 / P03581 43.6% identity; Global alignment score: 414 10 20 30 40 50 60 P03579 MPYTINSPSQFVYLSSAYADPVQLINLCTNALGNQFQTQQARTTVQQQFADAWKPVPSMT : :.: .:::.::.. ::: . ..: :: .:.::::..: ... . . : : P03581 MAYSIPTPSQLVYFTENYADYIPFVNRLINARSNSFQTQSGRDELREILIKSQVSVVSPI 10 20 30 40 50 60 70 80 90 100 110 P03579 VRFPASD-FYVYRYNSTLDPLITALLNSFDTRNRIIEVDNQPAPNTTEIVNATQRVDDAT :::: .:.: . ... . ::::.: :::::.:::.:. .:.: .::..:.:::. P03581 SRFPAEPAYYIYLRDPSISTVYTALLQSTDTRNRVIEVENSTNVTTAEQLNAVRRTDDAS 70 80 90 100 110 120 120 130 140 150 P03579 VAIRASINNLANELVRGTGMFNQAGFETASGLVW--TTTPAT- .::. ....: . :. :::.::...::.::::.: :::: : P03581 TAIHNNLEQLLSLLTNGTGVFNRTSFESASGLTWLVTTTPRTA 130 140 150 160 Cos(P03579, P03581) = 0.04

Identity Vector Homology Vector Improving Sensitivity using homology 4-grams P03579 MPYTINSPSQFVYLSSAY : :.: .:::.::.. : P03581 MAYSIPTPSQLVYFTENY Identity 4-grams Homology 4-grams SPSQ APSQ NPSQ TPSQ … SPSK

Including homology in vector comparison Homology Vector Query Sequence Identity Vector ah ai Unknown Sequence Score = cos( ai ) + lcos( ah )

4-gram Search Results

Correlation between cosine value and Sequence alignment % identity

Conclusions The use of homology 4-grams improve detection of distant sequences (30 – 55% sequence identity). The 4-gram based method seems to be suitable also for sequence search. After precalculation of the sequences’ 4-gram vector it is possible to compare two sequences with time complexity of O(1).

The use of 4-grams for Protein Classification and Sequence Comparison

The use of 4-grams for Protein Classification and Sequence Comparison

Presentation Transcript

From Protein Sequence to Function: Functional Analysis of Protein Sequences and Protein Classification

Protein classification

Sequence Comparison

Protein function and classification

Protein function and classification

Protein function and classification

Sequence Comparison

Sequence comparison and Phylogeny

Sequence comparison

Protein Classification

Protein Classification

Sequence Comparison

Protein sequence analysis

Sequence Comparison

Pairwise Sequence Comparison

Pairwise sequence comparison

Protein Primary Sequence

Protein Sequence

Protein Classification

Applicability of N-Grams to Data Classification

Protein classification