package containers

  1. Overview
  2. Docs
Legend:
Library
Module
Module type
Parameter
Class
Class type

Levenshtein distance

We take inspiration from http://blog.notdot.net/2010/07/Damn-Cool-Algorithms-Levenshtein-Automata for the main algorithm and ideas. However some parts are adapted

type 'a sequence = ('a -> unit) -> unit
type 'a gen = unit -> 'a option

Abstraction over Strings

Due to the existence of several encodings and string representations we abstract over the type of strings. A string is a finite array of characters (8-bits char, unicode runes, etc.) which provides a length operation and a function to access the n-th character.

module type STRING = sig ... end

Continuation list

This data structure is used to represent a list of result that is evaluated only as far as the user wants. If the user only wants a few elements, she doesn't pay for the remaining ones.

In particular, when matching a string against a (big) set of indexed strings, we return a continuation list so that, even if there are many results, only those actually asked for are evaluated.

type 'a klist = unit -> [ `Nil | `Cons of 'a * 'a klist ]
val klist_to_list : 'a klist -> 'a list

Helper for short lists.

Signature

The signature for a given string representation provides 3 main things:

  • a edit_distance function to compute the edit distance between strings
  • an automaton type that is built from a string s and a maximum distance n, and only accepts the strings s' such that edit_distance s s' <= n.
  • an Index module that can be used to map many strings to values, like a regular string map, but for which retrieval is fuzzy (for a given maximal distance).

A possible use of the index could be:

let words = CCIO.with_in "/usr/share/dict/words"
  (fun i -> CCIO.read_all i |> CCString.Split.list_cpy ~by:"\n");;

let words = List.map (fun s->s,s) words;;
let idx = CCLevenshtein.Index.of_list words;;

CCLevenshtein.Index.retrieve ~limit:1 idx "hell" |> CCLevenshtein.klist_to_list;;
module type S = sig ... end

Functor

module Make (Str : STRING) : S with type string_ = Str.t and type char_ = Str.char_

Default instance: string

include S with type char_ = char and type string_ = string
type char_ = char
type string_ = string
Edit Distance
val edit_distance : string_ -> string_ -> int

Edition distance between two strings. This satisfies the classical distance axioms: it is always positive, symmetric, and satisfies the formula distance a b + distance b c >= distance a c

Automaton

An automaton, built from a string s and a limit n, that accepts every string that is at distance at most n from s.

type automaton

Levenshtein automaton

val of_string : limit:int -> string_ -> automaton

Build an automaton from a string, with a maximal distance limit. The automaton will accept strings whose edit_distance to the parameter is at most limit.

val of_list : limit:int -> char_ list -> automaton

Build an automaton from a list, with a maximal distance limit

val match_with : automaton -> string_ -> bool

match_with a s matches the string s against a, and returns true if the distance from s to the word represented by a is smaller than the limit used to build a

Index for one-to-many matching
module Index : sig ... end
val debug_print : out_channel -> automaton -> unit