Encoding and decoding

cf_xarray aims to support encoding and decoding variables using CF conventions not yet implemented by Xarray.

Compression by gathering

The “compression by gathering” convention could be used for either pandas.MultiIndex objects or pydata/sparse arrays.

MultiIndex

cf_xarray provides encode_multi_index_as_compress() and decode_compress_to_multi_index() to encode MultiIndex-ed dimensions using “compression by gethering”.

Here’s a test dataset

ds = xr.Dataset(
    {"landsoilt": ("landpoint", np.random.randn(4), {"foo": "bar"})},
    {
        "landpoint": pd.MultiIndex.from_product(
            [["a", "b"], [1, 2]], names=("lat", "lon")
        )
    },
)
ds
/tmp/ipykernel_2888/746089171.py:1: FutureWarning: the `pandas.MultiIndex` object(s) passed as 'landpoint' coordinate(s) or data variable(s) will no longer be implicitly promoted and wrapped into multiple indexed coordinates in the future (i.e., one coordinate for each multi-index level + one dimension coordinate). If you want to keep this behavior, you need to first wrap it explicitly using `mindex_coords = xarray.Coordinates.from_pandas_multiindex(mindex_obj, 'dim')` and pass it as coordinates, e.g., `xarray.Dataset(coords=mindex_coords)`, `dataset.assign_coords(mindex_coords)` or `dataarray.assign_coords(mindex_coords)`.
  ds = xr.Dataset(
<xarray.Dataset> Size: 128B
Dimensions:    (landpoint: 4)
Coordinates:
  * landpoint  (landpoint) object 32B MultiIndex
  * lat        (landpoint) object 32B 'a' 'a' 'b' 'b'
  * lon        (landpoint) int64 32B 1 2 1 2
Data variables:
    landsoilt  (landpoint) float64 32B -2.014 -1.37 -0.2165 -0.7698

First encode (note the "compress" attribute on the landpoint variable)

encoded = cfxr.encode_multi_index_as_compress(ds, "landpoint")
encoded
<xarray.Dataset> Size: 96B
Dimensions:    (landpoint: 4, lat: 2, lon: 2)
Coordinates:
  * lat        (lat) object 16B 'a' 'b'
  * lon        (lon) int64 16B 1 2
  * landpoint  (landpoint) int64 32B 0 1 2 3
Data variables:
    landsoilt  (landpoint) float64 32B -2.014 -1.37 -0.2165 -0.7698

At this point, we can write encoded to a CF-compliant dataset using xarray.Dataset.to_netcdf() for example. After reading that file, decode using

decoded = cfxr.decode_compress_to_multi_index(encoded, "landpoint")
decoded
<xarray.Dataset> Size: 128B
Dimensions:    (landpoint: 4)
Coordinates:
  * landpoint  (landpoint) object 32B MultiIndex
  * lat        (landpoint) object 32B 'a' 'a' 'b' 'b'
  * lon        (landpoint) int64 32B 1 2 1 2
Data variables:
    landsoilt  (landpoint) float64 32B -2.014 -1.37 -0.2165 -0.7698

We roundtrip perfectly

ds.identical(decoded)
True

Sparse arrays

This is unsupported currently but a pull request is welcome!