Encoding and decoding#

cf_xarray aims to support encoding and decoding variables using CF conventions not yet implemented by Xarray.

Compression by gathering#

The “compression by gathering” convention could be used for either pandas.MultiIndex objects or pydata/sparse arrays.

MultiIndex#

cf_xarray provides encode_multi_index_as_compress() and decode_compress_to_multi_index() to encode MultiIndex-ed dimensions using “compression by gethering”.

Here’s a test dataset

ds = xr.Dataset(
    {"landsoilt": ("landpoint", np.random.randn(4), {"foo": "bar"})},
    {
        "landpoint": pd.MultiIndex.from_product(
            [["a", "b"], [1, 2]], names=("lat", "lon")
        )
    },
)
ds
/tmp/ipykernel_2878/746089171.py:1: FutureWarning: the `pandas.MultiIndex` object(s) passed as 'landpoint' coordinate(s) or data variable(s) will no longer be implicitly promoted and wrapped into multiple indexed coordinates in the future (i.e., one coordinate for each multi-index level + one dimension coordinate). If you want to keep this behavior, you need to first wrap it explicitly using `mindex_coords = xarray.Coordinates.from_pandas_multiindex(mindex_obj, 'dim')` and pass it as coordinates, e.g., `xarray.Dataset(coords=mindex_coords)`, `dataset.assign_coords(mindex_coords)` or `dataarray.assign_coords(mindex_coords)`.
  ds = xr.Dataset(
<xarray.Dataset> Size: 128B
Dimensions:    (landpoint: 4)
Coordinates:
  * landpoint  (landpoint) object 32B MultiIndex
  * lat        (landpoint) object 32B 'a' 'a' 'b' 'b'
  * lon        (landpoint) int64 32B 1 2 1 2
Data variables:
    landsoilt  (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806

First encode (note the "compress" attribute on the landpoint variable)

encoded = cfxr.encode_multi_index_as_compress(ds, "landpoint")
encoded
<xarray.Dataset> Size: 96B
Dimensions:    (landpoint: 4, lat: 2, lon: 2)
Coordinates:
  * lat        (lat) object 16B 'a' 'b'
  * lon        (lon) int64 16B 1 2
  * landpoint  (landpoint) int64 32B 0 1 2 3
Data variables:
    landsoilt  (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806

At this point, we can write encoded to a CF-compliant dataset using xarray.Dataset.to_netcdf() for example. After reading that file, decode using

decoded = cfxr.decode_compress_to_multi_index(encoded, "landpoint")
decoded
<xarray.Dataset> Size: 128B
Dimensions:    (landpoint: 4)
Coordinates:
  * landpoint  (landpoint) object 32B MultiIndex
  * lat        (landpoint) object 32B 'a' 'a' 'b' 'b'
  * lon        (landpoint) int64 32B 1 2 1 2
Data variables:
    landsoilt  (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806

We roundtrip perfectly

ds.identical(decoded)
True

Sparse arrays#

This is unsupported currently but a pull request is welcome!