Wikipedia

String metric

In mathematics and computer science, a string metric (also known as a string similarity metric or string distance function) is a metric that measures distance ("inverse similarity") between two text strings for approximate string matching or comparison and in fuzzy string searching. A requirement for a string metric (e.g. in contrast to string matching) is fulfillment of the triangle inequality. For example, the strings "Sam" and "Samuel" can be considered to be close.[1] A string metric provides a number indicating an algorithm-specific indication of distance.

The most widely known string metric is a rudimentary one called the Levenshtein distance (also known as edit distance).[2] It operates between two input strings, returning a number equivalent to the number of substitutions and deletions needed in order to transform one input string into another. Simplistic string metrics such as Levenshtein distance have expanded to include phonetic, token, grammatical and character-based methods of statistical comparisons.

String metrics are used heavily in information integration and are currently used in areas including fraud detection, fingerprint analysis, plagiarism detection, ontology merging, DNA analysis, RNA analysis, image analysis, evidence-based machine learning, database data deduplication, data mining, incremental search, data integration, Malware Detection, [3] and semantic knowledge integration.

List of string metrics

  • Levenshtein distance, or its generalization edit distance
  • Damerau–Levenshtein distance
  • Sørensen–Dice coefficient
  • Block distance or L1 distance or City block distance
  • Hamming distance
  • Jaro–Winkler distance
  • Simple matching coefficient (SMC)
  • Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
  • Tversky index
  • Overlap coefficient
  • Variational distance
  • Hellinger distance or Bhattacharyya distance
  • Information radius (Jensen–Shannon divergence)
  • Skew divergence
  • Confusion probability
  • Tau metric, an approximation of the Kullback–Leibler divergence
  • Fellegi and Sunters metric (SFS)
  • Maximal matches
  • Grammar-based distance
  • TFIDF distance metric[4]

Selected string measures examples

Name Example
Hamming distance "karolin" and "kathrin" is 3.
Levenshtein distance and Damerau–Levenshtein distance kitten and sitting have a distance of 3.
  1. kittensitten (substitution of "s" for "k")
  2. sittensittin (substitution of "i" for "e")
  3. sittinsitting (insertion of "g" at the end).
Jaro–Winkler distance JaroWinklerDist("MARTHA","MARHTA") =
  • is the number of matching characters;
  • is half the number of transpositions("MARTHA"[3]!=H, "MARHTA"[3]!=T).
Most frequent k characters MostFreqKeySimilarity('research', 'seeking', 2) = 2


References

  1. ^ Lu, Jiaheng; et al. (2013). "String similarity measures and joins with synonyms". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data: 373–384. doi:10.1145/2463676.2465313. ISBN 9781450320375.
  2. ^ Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys. 33 (1): 31–88. doi:10.1145/375360.375365.
  3. ^ Shlomi Dolev; Mohammad, Ghanayim; Alexander, Binun; Sergey, Frenkel; Yeali, S. Sun (2017). "Relationship of Jaccard and edit distance in malware clustering and online identification". 16th IEEE International Symposium on Network Computing and Applications: 369–373.
  4. ^ Cohen, William; Ravikumar, Pradeep; Fienberg, Stephen (2003-08-01). "A Comparison of String Distance Metrics for Name-Matching Tasks": 73–78.


External links

This article is copied from an article on Wikipedia® - the free encyclopedia created and edited by its online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of Wikipedia® encyclopedia articles provide accurate and timely information, please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.

Copyright © 2003-2025 Farlex, Inc Disclaimer
All content on this website, including dictionary, thesaurus, literature, geography, and other reference data is for informational purposes only. This information should not be considered complete, up to date, and is not intended to be used in place of a visit, consultation, or advice of a legal, medical, or any other professional.