open_packaging

A library for parsing Microsoft's Open Packaging Specification (most commonly used for Microsoft
README


The repo contains three libraries for reading data from Microsoft's document
formats ("Office Open XML").

  • open_packaging parses Office Open XML's "Open Packaging Conventions"
    (the container format for all Microsoft Office documents)

  • spreadsheetml parses the XML data in SpreadsheetML (i.e. Excel's XLSX
    format)

  • easy_xlsx reads XLSX documents, applies the formatting in the document,
    and returns the result as a list of sheets (with a sheet name and then
    a string list list of data). The goal of this library is to give the
    same results as if you exported the XLSX file to CSV and then read it with
    a standard CSV reader.

open_packaging and spreadsheetml are relatively safe to use but incomplete
(it should be obvious what they're mising -- if a field doesn't exist, I
haven't got to it yet). Everything that does exist should be parsed properly.

easy_xlsx is in very early stages. It should properly give read XLSX files
and output correct types, but the SpreadsheetML spec doesn't list all of the
built-in format strings, so some types may not be handled correctly. At the
moment, easy_xlsx will bail out in any case where it can't understand the
formatting, although I'd be open to patches to make this optional.

Building

Install dependencies:

opam pin add -n easy_xlsx .
opam depext easy_xlsx
opam install --deps-only easy_xlsx

Then build:

make

You can run the tests if you want:

make test

The Makefile is just a thin wrapper around jbuilder, so you can use
jbuilder commands too if you prefer.

Helping

If you want to help with this, create an issue with what you'd like to work
on, and mention it if you need me to help with anything (point you at the
relevant spec, give my opinion on approaches, etc.).

Some things I could use help with:

  • Implement more of the specs. See ECMA 376.
    The editions only contain changes, so most of what's interesting is in
    the 1st edition. Part 2 is the most interesting for the Open Packaging
    Conventions and Part 4 has specifics for SpreadsheetML (or the other office
    formats if you want to start a library for them).

  • Add more tests. Right now there's a set of extremely high level tests
    that we get the sound output as OpenOffice's CSV export, but being so high
    level means a lot of things are completely untested until we're 100%
    finished, which isn't a good situation to be in. It's easy to have a typo
    when implementing this spec, so I'd like to aim for 100% test coverage.

  • The cell formatting needs a lot of work. It's what I'm likely to work on
    next, but if you think you can do it first I'd be happy to let someone
    else do it (let me know if you start working on this though, so we don't
    duplicate effort).

  • Make CamlZip support a string or bigstring interface.
    Right now we can only read actual files on the filesystem since that's the
    interface CamlZip gives us. I'd like to be able to read data from memory.

  • Make a pure-OCaml ZIP library. Then we won't need any system dependencies
    and should work with js_of_ocaml too.

Install
Published
01 Jun 2018
Sources
easy_xlsx-1.0.tbz
md5=63407b5190cf6f529343a29953c8ff2e
Dependencies
ounit
with-test
jbuilder
>= "1.0+beta18"
bisect_ppx
build & >= "1.3.0"
base
= "v0.10.0"
ocaml
>= "4.04.2"
Reverse Dependencies