Encoding and decoding#
cf_xarray
aims to support encoding and decoding variables using CF conventions not yet implemented by Xarray.
Compression by gathering#
The “compression by gathering”
convention could be used for either pandas.MultiIndex
objects or pydata/sparse
arrays.
MultiIndex#
cf_xarray
provides encode_multi_index_as_compress()
and decode_compress_to_multi_index()
to encode MultiIndex-ed
dimensions using “compression by gethering”.
Here’s a test dataset
ds = xr.Dataset(
{"landsoilt": ("landpoint", np.random.randn(4), {"foo": "bar"})},
{
"landpoint": pd.MultiIndex.from_product(
[["a", "b"], [1, 2]], names=("lat", "lon")
)
},
)
ds
/tmp/ipykernel_2878/746089171.py:1: FutureWarning: the `pandas.MultiIndex` object(s) passed as 'landpoint' coordinate(s) or data variable(s) will no longer be implicitly promoted and wrapped into multiple indexed coordinates in the future (i.e., one coordinate for each multi-index level + one dimension coordinate). If you want to keep this behavior, you need to first wrap it explicitly using `mindex_coords = xarray.Coordinates.from_pandas_multiindex(mindex_obj, 'dim')` and pass it as coordinates, e.g., `xarray.Dataset(coords=mindex_coords)`, `dataset.assign_coords(mindex_coords)` or `dataarray.assign_coords(mindex_coords)`.
ds = xr.Dataset(
<xarray.Dataset> Size: 128B Dimensions: (landpoint: 4) Coordinates: * landpoint (landpoint) object 32B MultiIndex * lat (landpoint) object 32B 'a' 'a' 'b' 'b' * lon (landpoint) int64 32B 1 2 1 2 Data variables: landsoilt (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806
First encode (note the "compress"
attribute on the landpoint
variable)
encoded = cfxr.encode_multi_index_as_compress(ds, "landpoint")
encoded
<xarray.Dataset> Size: 96B Dimensions: (landpoint: 4, lat: 2, lon: 2) Coordinates: * lat (lat) object 16B 'a' 'b' * lon (lon) int64 16B 1 2 * landpoint (landpoint) int64 32B 0 1 2 3 Data variables: landsoilt (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806
At this point, we can write encoded
to a CF-compliant dataset using xarray.Dataset.to_netcdf()
for example.
After reading that file, decode using
decoded = cfxr.decode_compress_to_multi_index(encoded, "landpoint")
decoded
<xarray.Dataset> Size: 128B Dimensions: (landpoint: 4) Coordinates: * landpoint (landpoint) object 32B MultiIndex * lat (landpoint) object 32B 'a' 'a' 'b' 'b' * lon (landpoint) int64 32B 1 2 1 2 Data variables: landsoilt (landpoint) float64 32B -0.3698 0.6078 -0.1147 -0.5806
We roundtrip perfectly
ds.identical(decoded)
True
Sparse arrays#
This is unsupported currently but a pull request is welcome!