API Reference#

intake_virtual_icechunk provides a few core public components:

intake_virtual_icechunk.core.IcechunkCatalog: the main intake catalog implementation, registered as the virtual_icechunk driver.
intake_virtual_icechunk._source.IcechunkDataSource: the per-entry data source returned when you index into an IcechunkCatalog.
intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel: the JSON sidecar model used to persist catalog metadata and reopen a store later.
intake_virtual_icechunk.source._build.IcechunkStoreBuilder: builds a virtual Icechunk store from a pre-built intake-esm catalog.
intake_virtual_icechunk.source._containers.VirtualChunkContainerModel: stores enough virtual chunk container configuration to round-trip a catalog safely.

The following API summary is auto-generated.

class intake_virtual_icechunk.core.IcechunkCatalog(*args, **kwargs)

An intake plugin for reading an Icechunk store built from an intake-esm catalog.

The store contains one Zarr group per dataset, written by IcechunkStoreBuilder. Per-entry metadata (the attributes used for searching) is stored in each group’s .zattrs. This catalog mirrors the esm_datastore API so that switching between the two is straightforward.

Registered as the virtual_icechunk intake driver, so it is accessible via intake.open_virtual_icechunk().

Parameters:

storestr: Path or URI to the Icechunk store. Supported schemes: local path, s3://, gs:// / gcs://, az://.
storage_optionsdict, optional: Credential/config keyword arguments forwarded to the Icechunk storage backend (e.g. {'from_env': True} for S3).
sidecar_optionsdict, optional: obstore config kwargs used only to read the JSON sidecar. When omitted, storage_options is reused for the sidecar read.
xarray_kwargsdict, optional: Keyword arguments forwarded to xarray.open_zarr().
virtual_chunk_modeldict or VirtualChunkContainerModel, optional: Pre-loaded virtual chunk container configuration. Supplying this skips the sidecar read and is mainly used by from_json and search.
catalog_idstr, optional: Catalog identifier loaded from a JSON sidecar.
intake_kwargsdict, optional: Additional keyword arguments passed through to Catalog.

Attributes:

columns_with_iterables

Return a set of column names that contain iterable values (e.g. lists).

This is needed to know which columns to unpack when doing searches with iterable query values.

df

Return a DataFrame of all catalog entry metadata.

Each row corresponds to one Zarr group (catalog entry). The key column holds the group path; remaining columns are drawn from each group’s .zattrs as written by IcechunkStoreBuilder.

Methods

`from_json`(json_file, *[, xarray_kwargs, ...])	Load an `IcechunkCatalog` from a JSON sidecar file.
`keys`()	Get keys for the catalog entries (one per top-level Zarr group in the store).
`save`(name, *[, directory, json_dump_kwargs])	Save a JSON sidecar file pointing to this catalog's Icechunk store.
`search`(**query)	Search for entries in the catalog by matching group `.zattrs`.
`to_dask`(args, *kwargs)	Return a dask container for this data source
`to_dataset_dict`([xarray_kwargs, ...])	Load catalog entries into a dictionary of xarray Datasets.
`to_xarray`(**kwargs)	Return the catalog as a single xarray Dataset.
`unique`()	Get the number of unique values for each column in the catalog DataFrame.

Examples

Open a catalog saved by IcechunkStoreBuilder:

>>> import intake
>>> cat = intake.open_virtual_icechunk('/path/to/store')
>>> cat.keys()
['CMIP.BCC.BCC-ESM1.historical', 'CMIP.BCC.BCC-ESM1.ssp585']
>>> ds = cat['CMIP.BCC.BCC-ESM1.historical'].to_xarray()

Or load from a JSON sidecar:

>>> cat = IcechunkCatalog.from_json('/path/to/catalog.json')

classmethod from_json(json_file, *, xarray_kwargs=None, storage_options=None)

Load an IcechunkCatalog from a JSON sidecar file.

Parameters:

json_filestr: Path or URL to the catalog JSON file produced by save() or IcechunkStoreBuilder.
xarray_kwargsdict, optional: Keyword arguments forwarded to xarray.open_zarr().
storage_optionsdict, optional: obstore config kwargs for reading the JSON file itself (not for the Icechunk store — those are embedded in the JSON).

__init__(store, *, storage_options=None, sidecar_options=None, xarray_kwargs=None, virtual_chunk_model=None, catalog_id=None, **intake_kwargs)

Parameters:

entriesdict, optional: Mapping of {name: entry}
namestr, optional: Unique identifier for catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.
descriptionstr, optional: Description of the catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.
metadata: dict: Additional information about this data
ttlfloat, optional: Lifespan (time to live) of cached modification time. Units are in seconds. Defaults to 1.
getenv: bool: Can parameter default fields take values from the environment
getshell: bool: Can parameter default fields run shell commands
persist_mode: [‘always’, ‘default’, ‘never’]: Defines the use of persisted sources: if ‘always’, will use a persisted version of a data source, if it exists, if ‘never’ will always use the original source. If ‘default’, persisted sources will be used if they have not expired, and re-persisted and used if they have.
storage_optionsdict: If using a URL beginning with ‘intake://’ (remote Intake server), parameters to pass to requests when issuing http commands; otherwise parameters to pass to remote backend file-system. Ignored for normal local files.

keys()

Get keys for the catalog entries (one per top-level Zarr group in the store).

Returns:

list of str: Group path keys, one per dataset written by IcechunkStoreBuilder. When this catalog is the result of search(), only the matching keys are returned.

save(name, *, directory=None, json_dump_kwargs=None)

Save a JSON sidecar file pointing to this catalog’s Icechunk store.

Parameters:

namestr: Stem of the output file (without the .json extension).
directorystr, optional: Directory to write the file to. Defaults to the current working directory.
json_dump_kwargsdict, optional: Additional keyword arguments forwarded to json.dump().

search(**query)

Search for entries in the catalog by matching group .zattrs.

Parameters:

require_all_onstr or list of str, optional: If specified, the given column(s) must match all values in the query. Mostly for back compatibility with intake-esm, although I don’t really understand it & I’m not sure it should be kept
**query: Each keyword maps to a .zattrs attribute name. The value may be a scalar or a list of allowed values.

Returns:

IcechunkCatalog: A new catalog containing only the matching entries. The underlying Icechunk store is shared — it is not re-opened.

Examples

>>> cat.search(source_id='BCC-ESM1')
>>> cat.search(experiment_id=['historical', 'ssp585'])
>>> cat.search(source_id='BCC-ESM1', experiment_id='historical')

to_dask(*args, **kwargs): Return a dask container for this data source

to_dataset_dict(xarray_kwargs=None, progressbar=True, preprocess=None, storage_options=None)

Load catalog entries into a dictionary of xarray Datasets.

Parameters:

xarray_kwargsdict, optional: Keyword arguments forwarded to xarray.open_zarr(). Merged with (and taking precedence over) the xarray_kwargs supplied at construction time.
progressbarbool, optional: If True, display a progress bar while loading datasets.
preprocesscallable, optional: A callable with the signature preprocess(ds: xr.Dataset) -> xr.Dataset applied to each dataset immediately after loading, mirroring the preprocess argument of xarray.open_mfdataset().
storage_optionsdict, optional: Storage credentials/config merged with (and taking precedence over) the catalog-level storage_options before constructing each data source. Retained for API parity with intake-esm; the already-opened Icechunk store object does not use these options.

Returns:

dict of str -> xarray.Dataset: One Dataset per catalog entry, keyed by the group path.

to_xarray(**kwargs)

Return the catalog as a single xarray Dataset.

Only valid when the catalog contains exactly one entry.

Parameters:

**kwargs: Passed through to to_dataset_dict().

Returns:

xarray.Dataset

Raises:

ValueError: If the catalog contains zero or more than one entry.

unique()

Get the number of unique values for each column in the catalog DataFrame.

Iterable-valued columns are exploded before counting so their values are counted individually rather than as whole tuples.

property columns_with_iterables

Return a set of column names that contain iterable values (e.g. lists).

This is needed to know which columns to unpack when doing searches with iterable query values.

property df

Return a DataFrame of all catalog entry metadata.

Each row corresponds to one Zarr group (catalog entry). The key column holds the group path; remaining columns are drawn from each group’s .zattrs as written by IcechunkStoreBuilder.

class intake_virtual_icechunk._source.IcechunkDataSource(*args, **kwargs)

An intake-compatible Data Source for a single Zarr group in an Icechunk store.

This is the per-entry source returned by IcechunkCatalog when a key is looked up. It mirrors ESMDataSource so the two plugins feel identical to callers.

Parameters:

keystr: The catalog key / Zarr group path for this dataset.
storeicechunk.IcechunkStore: An already-opened, zarr-compatible IcechunkStore. Obtain one via IcechunkCatalog._zarr_store (or icechunk.Repository.open(...).readonly_session('main').store). Passing a pre-opened store avoids re-opening the repository for every data source.
groupstr: Zarr group path within the store to open.
storage_optionsdict, optional: Retained for API compatibility; not used when store is already an IcechunkStore.
xarray_kwargsdict, optional: Keyword arguments forwarded to xarray.open_zarr().
intake_kwargsdict, optional: Additional keyword arguments passed through to DataSource.

Attributes:

ds: The xarray Dataset for this data source.

Methods

`close`()	Drop the open dataset from memory.
`to_xarray`()	Return the xarray Dataset (with dask-backed arrays).

__init__(key, store, group, *, storage_options=None, xarray_kwargs=None, intake_kwargs=None)

close(): Drop the open dataset from memory.

to_xarray(): Return the xarray Dataset (with dask-backed arrays).

property ds: The xarray Dataset for this data source.

class intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel(*, id='', version='1.0.0', store, virtual_chunk_model=None, description=None, title=None, last_updated=None, storage_options={})

Pydantic model for a Virtual Icechunk catalog sidecar file.

The sidecar JSON is a lightweight pointer to an Icechunk store together with catalog-level metadata. All per-entry (dataset) metadata is stored in each Zarr group’s .zattrs, written by IcechunkStoreBuilder.

Methods

`load`(json_file[, storage_options])	Load a catalog model from a JSON sidecar file.
`save`(name, *, store[, json_dump_kwargs])	Save the catalog model to a JSON sidecar file.

Examples

Save a catalog pointer:

>>> model = VirtualIcechunkCatalogModel(
...     store='s3://my-bucket/my-catalog.icechunk',
...     virtual_chunk_model=virtual_chunk_model,
...     storage_options={'from_env': True},
...     description='My climate catalog',
... )
>>> from obstore.store import from_url
>>> model.save('my-catalog', store=from_url('file:///path/to/output'))

Load it back:

>>> model = VirtualIcechunkCatalogModel.load('/path/to/output/my-catalog.json')

classmethod load(json_file, storage_options=None)

Load a catalog model from a JSON sidecar file.

Parameters:

json_filestr: Path or URL to the JSON sidecar file.
storage_optionsdict, optional: obstore config kwargs for reading the JSON file itself (e.g. S3 credentials). These are independent of the catalog’s own storage_options, which are stored inside the JSON and used to open the Icechunk store.

save(name, *, store, json_dump_kwargs=None)

Save the catalog model to a JSON sidecar file.

Parameters:

namestr: Stem of the output file. If it ends with ‘.json’ it will be stripped and re-added to ensure we get a single .json ext, no matter what.
storeObjectStore: An obstore store (e.g. S3Store, LocalStore) pointing at the directory into which the sidecar should be written.
json_dump_kwargsdict, optional: Additional keyword arguments forwarded to json.dump().

model_config = {'validate_assignment': True}: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class intake_virtual_icechunk.source._build.IcechunkStoreBuilder(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None)

Build a real Icechunk store by copying data from an intake-esm datastore.

Given a pre-built intake-esm catalog, this builder iterates over every dataset group in the catalog, opens the constituent files with xarray.open_mfdataset, and writes each dataset as a named Zarr group inside a single Icechunk store. Data chunks are copied into the Icechunk store, so source files do not need to remain accessible once the build is complete.

The resulting store requires no virtual chunk container configuration: it can be opened with IcechunkCatalog directly, without supplying any source-data credentials.

Parameters:

esm_datastore_pathPath or str: Path to an existing intake-esm catalog JSON file. Stored internally as a string.
icechunk_store_pathPath or str: Path or URI at which to create the Icechunk store. Supported schemes: local path, s3://, gs:// / gcs://, az://.
esm_datastore_kwargsdict, optional: Keyword arguments forwarded to intake.open_esm_datastore.
icechunk_storage_optionsdict, optional: Keyword arguments forwarded to the Icechunk storage backend for the target store. See intake_virtual_icechunk.utils._resolve_storage().
xarray_kwargsdict, optional: Keyword arguments forwarded to xarray.open_mfdataset when reading each group’s source files (e.g. {'decode_times': False}).
drop_colslist[str], optional: Column names in the intake-esm catalog’s assets dataframe to omit from attached Zarr group metadata. The asset path column is always omitted.
cols_to_deiterlist[str], optional: Columns whose deduplicated iterable metadata should be stored as a scalar by taking the first value.
xarray_kwargs: list[dict] | dict | None: Passed to xarray.open_mfdataset/open_dataset. If a list of dicts is passed, it must be the same length as the number of datasets in the datastore, and will be applied per dataset. If a single dict is passed, the same args will be passed to each dataset. If None, default arguments will be used. Combine related kwargs are dropped if passed to open_dataset.

Methods

`build`()	Build the Icechunk store by copying real data from the source assets.
`rechunk`([chunks, shards])	Configure re-chunking (and optional zarr v3 sharding) applied at build.

__init__(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None): Initialise a builder that copies real data into the Icechunk store.

build()

Build the Icechunk store by copying real data from the source assets.

For each dataset group in the intake-esm catalog:

Collects the asset file paths that belong to the group.
Opens the files with xarray.open_mfdataset.
Writes the dataset as a named Zarr group in the Icechunk store.
Writes the group’s groupby_attrs values into .zattrs.

After all groups are written, a JSON sidecar is saved alongside the store for use with from_json(). The sidecar does not contain a virtual chunk container entry because the store holds real data chunks.

Notes

Like the virtual builder path, this implementation records per-group failures in self.failed_list and continues building the remaining groups.

rechunk(chunks=None, shards=None)

Configure re-chunking (and optional zarr v3 sharding) applied at build.

Call before build(). If never called, source chunking is written unchanged. Returns self so calls can be chained.

Parameters:

chunks

On-disk zarr chunk shape. One of:

None — inherit the source file chunking (default).
"auto" — let dask pick a shape using its configured array.chunk-size.
a byte size — "128MiB" / "128" (128 bytes) / an int of bytes (parsed by dask.utils.parse_bytes()); dask picks a shape hitting roughly that size.
a mapping {dim: length} — explicit per-dimension chunk lengths; dimensions not listed keep their source chunking.

shards

On-disk zarr v3 shard shape (a shard bundles many chunks into one storage object). One of:

None — no sharding (default).
"auto" — target the median on-disk size of the first ~10 source files (≈ one shard object per input file). Resolved now, here.
a byte size — as for chunks.
a mapping {dim: length} — explicit per-dimension shard lengths.

A shard must be an integer multiple of the chunk on every dimension; a shard smaller than the chunk disables sharding for that variable (with a warning).

class intake_virtual_icechunk.source._containers.VirtualChunkContainerModel(url_prefix, store_type, open_kwargs=<factory>)

Serializable VirtualChunkContainer configuration for catalog sidecars.

Icechunk requires virtual chunk access to be configured explicitly when a repository is reopened. This model stores the non-secret parts of that configuration in the catalog JSON sidecar so read-side consumers can reconstruct an equivalent VirtualChunkContainer later.

Only explicitly safe kwargs are preserved in open_kwargs; credential-like values are intentionally omitted from the serialised form.

Methods

`from_dict`(d)	Construct the model from a dictionary, typically decoded from JSON.
`from_virtual_chunk_container`(vc_container[, ...])	Build a serialisable model from a live Icechunk container.
`to_dict`()	Return a plain dictionary suitable for JSON serialisation.
`to_virtual_chunk_container`()	Recreate an Icechunk `VirtualChunkContainer` from this model.

classmethod from_dict(d)

Construct the model from a dictionary, typically decoded from JSON.

Returns None if d is None.

static from_virtual_chunk_container(vc_container, store_options=None)

Build a serialisable model from a live Icechunk container.

Parameters:

vc_container: The configured Icechunk virtual chunk container.
store_options: Source-store options from the builder. Only keys listed in _VCC_SAFE_KWARGS are retained in the serialised output.

__init__(url_prefix, store_type, open_kwargs=<factory>)

to_dict(): Return a plain dictionary suitable for JSON serialisation.

to_virtual_chunk_container(): Recreate an Icechunk VirtualChunkContainer from this model.