Unstrctrd (Unstructured) is a lexer/parser according RFC822. It accepts any input which respects ABNF described by RFC5322 (including obsolete form). To contextualize the purpose, email header, a part of DEB format, or HTTP 1.1 header respect, at least, a form, the unstructured form which allows to split a value with a folding-whitespace token.
This token permits to limit any values to 80 characters per line:
To: Romain Calascibetta\r\n <email@example.com>
Then, others forms like email address or subject should, at least, be a subset of this form. The goal of this library is to delay complexity of this form to a little and basic library.
Unstrctrd handles UTF-8 as well (RFC6532). Any input should always terminate by CRLF. In other case, you can use
An usual process with Unstrctrd is to use
of_string and delete FWS with
let parse str = of_string str >>= fun (i, t) -> Ok (fold_fws t) ;;
You can canonicalize a string too. In other words, parse the given string, delete FWS and regenerate the string without any FWS such as:
# let canon str = let (_, t) = safely_decode str in let t = replace_invalid_bytes ~f:(fun _ -> None) t in let t = fold_fws t in to_utf_8_string t ;; # canon "Hello\r\n World!" ;; - : string = "Hello World!"
type elt = [
`Uchar of Uchar.t
`WSP of wsp
`FWS of wsp
`OBS_NO_WS_CTL of obs
`Invalid_char of invalid_char
type t = private elt list
val empty : t
val length : t -> int
of_string raw tries to parse
raw and extract the unstructured form.
raw should, at least, terminate by CRLF.
val safely_decode : string -> int * t
safely_decode str parses the given string and return a
t and how many bytes it consumed. The process puts systematically a CRLF at the end of the given string to never fails.
val replace_invalid_bytes : f:(invalid_char -> elt option) -> t -> t
replace_invalid_bytes f t wants to replace or delete invalid bytes into the given
t. You probably can replace them by
of_list lst tries to coerce
t. It verifies that
lst can not produce CRLF terminating token (eg.
to_utf_8_string t returns a valid UTF-8 string of
t. The given
t must not contain
`Invalid_char, you probably should clean-up with
val wsp : len:int -> elt
val tab : len:int -> elt
val fws : ?tab:bool -> int -> elt
without_comments t tries to delete any comment of
t. A comment is a part which begins with
'(' and ends with
')'. If we find a non-associated parenthesis, we return an error.
val split_on : on:[ `WSP | `FWS | `Uchar of Uchar.t | `Char of char | `LF | `CR ] -> t -> (t * t) option
split_on ~on t is either the pair
(t0, t1) of the two (possibly empty) subparts of
t that are delimited by the first match of
on can't be matched in
t0 ^ sep ^ t1 = t holds.