Function weighSimilarityByCharacter

  • Measures similarity between two strings, taking into account the common characters and their positions. Jaro-Winkler is often used in record linkage and data cleansing to improve the accuracy of string matching, particularly for names and addresses, by giving more weight to the common prefix of the strings and penalizing longer string differences. It is more optimal for words than Levenshtein distance:

    1. Edit operations: Levenshtein considers insertions, deletions, and substitutions, while Jaro focuses on transpositions.
    2. Sensitivity to string length: Levenshtein is more sensitive to overall string length, while Jaro normalizes for length in its formula.
    3. Prefix matching: The Jaro-Winkler variant explicitly rewards matching prefixes, which Levenshtein does not.
    4. Scale of results: Levenshtein produces an edit distance (usually converted to a similarity score), while Jaro directly produces a similarity score.

    A Comprehensive List of Similarity Search Algorithms

    Parameters

    • s1: string

      First string

    • s2: string

      Second string

    Returns number

    0-1 string similarity score