package spelll

  1. Overview
  2. Docs

Levenshtein distance and index

We take inspiration from this blog for the main algorithm and ideas. However some parts are adapted

Abstraction over Strings

Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.

module type STRING = sig ... end

Continuation list

This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, she doesn't pay for the remaining ones.

In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.

type 'a klist = [
  1. | `Nil
  2. | `Cons of 'a * (unit -> 'a klist)
]
val klist_to_list : 'a klist -> 'a list

Helper.

Signature

The signature for a given string representation provides 3 main things:

  • a edit_distance function to compute the edit distance between strings
  • an automaton type that is built from a string s and a maximum distance n, and only accepts the strings s' such that edit_distance s s' <= n.
  • an Index module that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).

A possible use of the index could be:

open Batteries;;

let words = File.with_file_in "/usr/share/dict/english"
  (fun i -> IO.read_all i |> String.nsplit ~by:"\\n");;

let words = List.map (fun s->s,s) words;;
let idx = Spelll.Index.of_list words;;

Spelll.Index.retrieve ~limit:1 idx "hell" |> Spelll.klist_to_list;;

Here we use Batteries to read a dictionary file into a list of words; then we create an index that maps every string to itself (a set of strings, really), and finally we find every string at distance at most 1 from "hell" (including "hello" for instance).

module type S = sig ... end
module Make (Str : STRING) : S with type string_ = Str.t and type char_ = Str.char_
include S with type char_ = char and type string_ = string
type char_ = char
type string_ = string
Edit Distance
val edit_distance : string_ -> string_ -> int

Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c

Automaton

An automaton, built from a string s and a limit n, that accepts every string that is at distance at most n from s.

type automaton

Levenshtein automaton

val of_string : limit:int -> string_ -> automaton

Build an automaton from a string, with a maximal distance limit. The automaton will accept strings whose edit_distance to the parameter is at most limit.

val of_list : limit:int -> char_ list -> automaton

Build an automaton from a list, with a maximal distance limit

val match_with : automaton -> string_ -> bool

match_with a s matches the string s against a, and returns true if the distance from s to the word represented by a is smaller than the limit used to build a

Index for one-to-many matching
module Index : sig ... end
val debug_print : out_channel -> automaton -> unit