Skip to content

constantinpape/z5

Repository files navigation

z5 / z5py

Anaconda-Server Badge test-conda test-pypi docs DOI

The new z5 / z5py v3 release adds support for zarr v3 format and s3. It removes the xtensor dependency and is now available via PyPI, in addition to conda-forge.

C++ library (z5) with python bindings (z5py) for zarr and n5 file formats.

This library supports:

  • Zarr format v2 and v3.
  • The n5 format.
  • Access to zarr files on the filesystem and S3 object storage; n5 is only supported on the file system.

Support for the following compression codecs:

Installation

Conda

Conda packages for the relevant systems and python versions are hosted on conda-forge:

$ conda install -c conda-forge z5py

Pip

Wheels are published on PyPI:

$ pip install z5py

The PyPI wheels are built with all compression codecs but without the S3 backend (z5py.S3File will raise "z5 was not compiled with s3 support"). For S3 support, install via conda or build from source with -DWITH_S3=ON.

From Source

The easiest way to build the library from source is using a conda-environment with all necessary dependencies. You can find the conda environment files for build environments in .environments/unix

To set up the conda environment and install the package on unix:

$ conda env create -f environments/unix/z5-dev.yaml
$ conda activate z5-dev
$ mkdir bld
$ cd bld
$ cmake -DWITH_ZLIB=ON -DWITH_BZIP2=ON -DCMAKE_INSTALL_PREFIX=/path/to/install ..
$ make install

Note that in the CMakeLists.txt, we try to infer the active conda-environment automatically. If this fails, you can set it manually via -DCMAKE_PREFIX_PATH=/path/to/conda-env. To specify where to install the package, set:

  • CMAKE_INSTALL_PREFIX: where to install the C++ headers
  • PYTHON_MODULE_INSTALL_DIR: where to install the python package (set to site-packages of active conda env by default)

If you want to include z5 in another C++ project, note that the library itself is header-only. However, you need to link against the compression codecs that you use.

If you don't want to use conda for dependency management, the following dependencies are necessary:

Examples / Usage

Python

The Python API is very similar to h5py. Some differences are:

  • The constructor of File takes the boolean argument use_zarr_format, which determines whether the zarr or N5 format is used (if set to None, an attempt is made to automatically infer the format).
  • There is no need to close File, hence the with block isn't necessary (but supported).
  • Linked datasets (my_file['new_ds'] = my_file['old_ds']) are not supported
  • Broadcasting is only supported for scalars in Dataset.__setitem__
  • Arbitrary leading and trailing singleton dimensions can be added/removed/rolled through in Dataset.__setitem__
  • Compatibility of exception handling is a goal, but not necessarily guaranteed.
  • Because zarr/N5 are usually used with large data, z5py compresses blocks by default where h5py does not. The default compressors are
    • zarr: "blosc"
    • n5: "gzip"

Some examples:

import z5py
import numpy as np

# create a file and a dataset
f = z5py.File('array.zr', use_zarr_format=True)
ds = f.create_dataset('data', shape=(1000, 1000), chunks=(100, 100), dtype='float32')

# write array to a roi
x = np.random.random_sample(size=(500, 500)).astype('float32')
ds[:500, :500] = x

# broadcast a scalar to a roi
ds[500:, 500:] = 42.

# read array from a roi
y = ds[250:750, 250:750]

# create a group and create a dataset in the group
g = f.create_group('local_group')
g.create_dataset('local_data', shape=(100, 100), chunks=(10, 10), dtype='uint32')

# open dataset from group or file
ds_local1 = f['local_group/local_data']
ds_local2 = g['local_data']

# read and write attributes
attributes = ds.attrs
attributes['foo'] = 'bar'
baz = attributes['foo']

C++

Z5 aims to supports different storage implementations. The default is to use the filesystem, implementations to also supports AWS-S3 and Google Cloud Storage are work in progress. The API implements factory functions like createFile or createDataset in the factory header. These functions need to be called with the corresponding handle, like z5::filesystem::handle::File or z5::s3::handle::File in order to specify which backend to use.

The library is intended to be used with an in-memory multiarray that holds the data. Data is passed in and out via a lightweight, non-owning strided view, z5::multiarray::ArrayView / ConstArrayView (a data pointer plus shape and element strides, compatible with numpy arrays), see implementation and the IO functions readSubarray / writeSubarray in array_access.hxx. To interface with another multiarray implementation, wrap its buffer in an ArrayView (data pointer + shape + element strides).

Some examples:

#include "json.hpp"

// factory functions to create files, groups and datasets
#include "z5/factory.hxx"
// handles for z5 filesystem objects
#include "z5/filesystem/handle.hxx"
// strided-view io for multi-arrays
#include "z5/multiarray/array_access.hxx"
// attribute functionality
#include "z5/attributes.hxx"

int main() {

  // get handle to a File on the filesystem
  z5::filesystem::handle::File f("data.zr");
  // if you wanted to use a different backend, for example AWS, you
  // would need to use this instead:
  // z5::s3::handle::File f;

  // create the file in zarr format
  const bool createAsZarr = true;
  z5::createFile(f, createAsZarr);

  // create a new zarr dataset
  const std::string dsName = "data";
  std::vector<size_t> shape = { 1000, 1000, 1000 };
  std::vector<size_t> chunks = { 100, 100, 100 };
  auto ds = z5::createDataset(f, dsName, "float32", shape, chunks);

  // write array to roi; the data lives in a C-contiguous buffer that we
  // wrap in a (non-owning) strided view
  z5::types::ShapeType offset1 = { 50, 100, 150 };
  z5::types::ShapeType shape1 = { 150, 200, 100 };
  std::vector<float> buffer1(150 * 200 * 100, 42.0);
  z5::multiarray::ConstArrayView<float> array1(buffer1.data(), shape1,
                                               z5::multiarray::cOrderStrides(shape1));
  z5::multiarray::writeSubarray<float>(ds, array1, offset1.begin());

  // read array from roi (values that were not written before are filled with a fill-value)
  z5::types::ShapeType offset2 = { 100, 100, 100 };
  z5::types::ShapeType shape2 = { 300, 200, 75 };
  std::vector<float> buffer2(300 * 200 * 75);
  z5::multiarray::ArrayView<float> array2(buffer2.data(), shape2,
                                          z5::multiarray::cOrderStrides(shape2));
  z5::multiarray::readSubarray<float>(ds, array2, offset2.begin());

  // get handle for the dataset
  const auto dsHandle = z5::filesystem::handle::Dataset(f, dsName);

  // read and write json attributes
  nlohmann::json attributesIn;
  attributesIn["bar"] = "foo";
  attributesIn["pi"] = 3.141593;
  z5::writeAttributes(dsHandle, attributesIn);

  nlohmann::json attributesOut;
  z5::readAttributes(dsHandle, attributesOut);
  
  return 0;
}

C

There are external efforts to implement a C-Api wrapper for z5. You can check it out here.

R

There exists a prototype by @gdkrmr to provide R bindings for z5. It is still in an early stage, but looks very promising.

Citation

If you use this library in your research, please cite it via the associated DOI:

@misc{pape_z5_2019,
  doi = {10.5281/ZENODO.3585752},
  url = {https://zenodo.org/record/3585752},
  author = {Pape,  Constantin},
  title = {constantinpape/z5},
  publisher = {Zenodo},
  year = {2019}
}

Current Limitations / TODOs

  • No thread / process synchronization -> writing to the same chunk in parallel will lead to undefined behavior.
  • Supports only little endianness and C-order for the zarr format.

A note on axis ordering

Internally, n5 uses column-major (i.e. x, y, z) axis ordering, while z5 uses row-major (i.e. z, y, x). While this is mostly handled internally, it means that the metadata does not transfer 1 to 1, but needs to be reversed for most shapes. Concretely:

n5 z5
Shape s_x, s_y, s_z s_z, s_y, s_x
Chunk-Shape c_x, c_y, c_z c_z, c_y, c_x
Chunk-Ids i_x, i_y, i_z i_z, i_y, i_x

About

Lightweight C++ and Python interface for datasets in zarr and N5 format

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors