package fuzzy_compare

  1. Overview
  2. Docs
Fastest bounded Levenshtein comparator over generic structures

Install

Dune Dependency

Authors

Maintainers

Sources

1.0.0.tar.gz
md5=455dc4fbe810bffb087b84ab638994fc
sha512=52d41e9686b77206dd3b9efdfef3c9511cb8a66927af7548ee5454258232a7f1e3aeb1106ae49f75bc3d04dada962a58ed8e2cfcd5d902f0d5465357dd3541dd

Description

This library does not calculate the edit distance.

Rather, it provides extremely efficient automata that answer whether 2 values are within a predetermined number of edits of one another.

Once generated, an automaton can be reused to compare any 2 values in around 2-8 µs.

Published: 28 Aug 2022

README

fuzzy_compare

Extremely fast Levenshtein comparator.

Checks whether 2 values are within a predetermined number of edits of one another.

  • most comparisons take under 5 µs, depending on the length of the values

  • a Functor is provided to enable comparisons across any arbitrary types

  • string comparisons are provided (functorized) out of the box

  • reuse the same automaton across all comparisons with the same max_edits, regardless of the type of the values being compared

  • max_edits must be between 0 and 3 (inclusively) due to the astronomical scaling factor during graph building

Example

(* Create an automaton *)
let within3 = Fuzzy_compare.create ~max_edits:3 in

(* true *)
let b1 = Fuzzy_compare.String.eval within3 "kitten" "kitsch" in

(* false *)
let b2 = Fuzzy_compare.String.eval within3 "kittens" "kitsch" in

Unicode / arbitrary types

This example demonstrates how to compare unicode string as well as how to compare your own types.

For unicode we'll need uunf and uuseg:

opam install uunf uuseg

First apply the Make functor:

(* Fuzzy_compare.Make is documented in [fuzzy_compare.mli] *)
module Utf8 = Fuzzy_compare.Make (struct
  type t = string

  type cmp = string [@@deriving equal]

  type index = string array

  let index x =
    (* Normalize to Unicode NFC,
       then break the string into an array of Grapheme Clusters *)
    Uunf_string.normalize_utf_8 `NFC x
    |> Uuseg_string.fold_utf_8 `Grapheme_cluster (fun acc s -> s :: acc) []
    |> Array.of_list_rev

  let length = Array.length

  let get = Array.get

  let fold = Array.fold
end)

Then proceed as usual:

let within2 = Fuzzy_compare.create ~max_edits:2 in

(* false because each emoji is made of multiple bytes *)
let b1 = Fuzzy_compare.String.eval within2 "🏁🏴🚩" "🚩🏴🏁" in


(* true thanks to Unicode awareness *)
let b1 = Utf8.eval within2 "🏁🏴🚩" "🚩🏴🏁" in

(* false *)
let b2 = Utf8.eval within2 "🏁🏴🚩" "🚩🏳️" in

Thanks

Thanks to Paul Masurel and Jules Jacob for explaining clearly the beauty of the algorithms contained within the "Fast String Correction with Levenshtein-Automata" paper, and thanks to Klaus Schulz and Stoyan Mihov for writing the paper.

This library implements the most sophisticated (and most efficient) version described by Paul Masurel, with several additional low level optimizations.

Recommended reading: https://julesjacobs.com/2015/06/17/disqus-levenshtein-simple-and-fast.html https://fulmicoton.com/posts/levenshtein/

Dependencies (3)

  1. core_kernel >= "v0.14.0" & < "v0.15.0"
  2. dune >= "1.9.0"
  3. ocaml >= "4.10.0"

Dev Dependencies (2)

  1. uunf with-test
  2. uuseg with-test

Used by

None

Conflicts

None

OCaml

Innovation. Community. Security.