unstrctrd 0.3 · OCaml Package

Unstrctrd.

Unstrctrd (Unstructured) is a lexer/parser according RFC822. It accepts any input which respects ABNF described by RFC5322 (including obsolete form). To contextualize the purpose, email header, a part of DEB format, or HTTP 1.1 header respect, at least, a form, the unstructured form which allows to split a value with a folding-whitespace token.

This token permits to limit any values to 80 characters per line:

To: Romain Calascibetta\r\n
 <romain@calascibetta.org>

Then, others forms like email address or subject should, at least, be a subset of this form. The goal of this library is to delay complexity of this form to a little and basic library.

Unstrctrd handles UTF-8 as well (RFC6532). Any input should always terminate by CRLF. In other case, you can use safely_decode.

An usual process with Unstrctrd is to use of_string and delete FWS with fold_fws like:

let parse str = of_string str >>= fun (i, t) -> Ok (fold_fws t) ;;

You can canonicalize a string too. In other words, parse the given string, delete FWS and regenerate the string without any FWS such as:

# let canon str =
    let (_, t) = safely_decode str in
    let t = replace_invalid_bytes ~f:(fun _ -> None) t in
    let t = fold_fws t in
    to_utf_8_string t ;;
# canon "Hello\r\n World!" ;;
- : string = "Hello World!"

type elt = [

| `Uchar of Stdlib.Uchar.t
| `WSP of wsp
| `LF
| `CR
| `FWS of wsp
| `d0
| `OBS_NO_WS_CTL of obs
| `Invalid_char of invalid_char

]

and wsp = private string

and obs = private char

and invalid_char = private char

type t = private elt list

type error = [

| `Msg of string

]

val empty : t

val length : t -> int

val of_string : string -> (int * t, [> error ]) Stdlib.result

of_string raw tries to parse raw and extract the unstructured form. raw should, at least, terminate by CRLF.

val safely_decode : string -> int * t

safely_decode str parses the given string and return a t and how many bytes it consumed. The process puts systematically a CRLF at the end of the given string to never fails.

val replace_invalid_bytes : f:(invalid_char -> elt option) -> t -> t

replace_invalid_bytes f t wants to replace or delete invalid bytes into the given t. You probably can replace them by `Uchar Uutf.u_rep.

val of_list : elt list -> (t, [> error ]) Stdlib.result

of_list lst tries to coerce lst to t. It verifies that lst can not produce CRLF terminating token (eg. [`CR; `LF]).

val to_utf_8_string : ?rep:Stdlib.Uchar.t -> t -> string

to_utf_8_string t returns a valid UTF-8 string of t. The given t must not contain `Invalid_char, you probably should clean-up with replace_invalid_bytes.

val iter : f:(elt -> unit) -> t -> unit

val fold : f:('a -> elt -> 'a) -> 'a -> t -> 'a

val map : f:(elt -> elt) -> t -> t

val wsp : len:int -> elt

val tab : len:int -> elt

val fws : ?tab:bool -> int -> elt

val without_comments : t -> (t, [> error ]) Stdlib.result

without_comments t tries to delete any comment of t. A comment is a part which begins with '(' and ends with ')'. If we find a non-associated parenthesis, we return an error.

val fold_fws : t -> t

val split_at : index:int -> t -> t * t

val split_on : 
  on:[ `WSP | `FWS | `Uchar of Stdlib.Uchar.t | `Char of char | `LF | `CR ] ->
  t ->
  (t * t) option

split_on ~on t is either the pair (t0, t1) of the two (possibly empty) subparts of t that are delimited by the first match of on or None if on can't be matched in t.

The invariant t0 ^ sep ^ t1 = t holds.

package unstrctrd

Unstrctrd.