220 likes | 791 Views
Hamming Distance. Very efficient, only for the strings with same length. Basically it simply counts the number of distinct characters. Wont help much for us. Levenstein distance. It measures distance in terms of the number of "operations" required to transform one string to another.
E N D
Hamming Distance • Very efficient, only for the strings with same length. • Basically it simply counts the number of distinct characters. • Wont help much for us.
Levenstein distance • It measures distance in terms of the number of "operations" required to transform one string to another. • These operations include insertion, deletion and substitution. • In Damerau- Levenstein distance transposition is included. • This may useful for spelling correction but I am not sure how it will efficient in our case.
Needlman-Wunsch • This algorithm is same like Damerau- Levenstein with weighted edit distance, this is used in biology. • Mainly used for Alignment • So obviously we don’t need it.
Smith–Waterman algorithm • Like Needlman – Wunsch algorithm this is also mainly used for alignment. • This also used in biology. • Gotoh distance is also used to find the alignment.
Jaro-Winkler Similarity • The order of occurrence is an essential determination of similarity. • For instance, the strings "martha" and "marhta" are considered a complete match because the transposed "th" and "ht" are within 2 characters of each other. • The more transposes found between the two strings, the smaller the overall matching weight.
Matching coefficient • This is simple same as hamming distance with one change- position is not important • Simply counts the number of terms present • |a ∩ b| - It doesn’t take in to account the sizes of a and b • There are some metrics which use the same with including sizes of a and b. those are as follows Any one of this may helpful for us
Jaccardcoefficient • The sentence is tokenized into words. Then words are compared with other sentence words. • |a ∩ b| / |a U b| • This is one of the most efficient algorithm. • Overlap Coefficient is similar with slight modulation is formula: |a ∩ b| / min(|a|,|b|)
Sørensen Similarity • Same as jaccard similarity with different formula. • Similarity = 2* |Number of intersection| / |union number of words| • This is Identical to Dice’s coefficient These may all be considered to be normalised versions of the simple matching coefficient
Other metrics • Other metrics like SFS, Tau, Confusion probability, Skew divergence, Cosine, TFIDF, etc are either not useful for us or contains big calculations which is not possible in our case.