API Reference

API Reference#

intake_virtual_icechunk provides a few core public components:

  • intake_virtual_icechunk.core.IcechunkCatalog: the main intake catalog implementation, registered as the virtual_icechunk driver.

  • intake_virtual_icechunk._source.IcechunkDataSource: the per-entry data source returned when you index into an IcechunkCatalog.

  • intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel: the JSON sidecar model used to persist catalog metadata and reopen a store later.

  • intake_virtual_icechunk.source._build.IcechunkStoreBuilder: builds a virtual Icechunk store from a pre-built intake-esm catalog.

  • intake_virtual_icechunk.source._containers.VirtualChunkContainerModel: stores enough virtual chunk container configuration to round-trip a catalog safely.

The following API summary is auto-generated.

class intake_virtual_icechunk.core.IcechunkCatalog(*args, **kwargs)

An intake plugin for reading an Icechunk store built from an intake-esm catalog.

The store contains one Zarr group per dataset, written by IcechunkStoreBuilder. Per-entry metadata (the attributes used for searching) is stored in each group’s .zattrs. This catalog mirrors the esm_datastore API so that switching between the two is straightforward.

Registered as the virtual_icechunk intake driver, so it is accessible via intake.open_virtual_icechunk().

Parameters:
storestr

Path or URI to the Icechunk store. Supported schemes: local path, s3://, gs:// / gcs://, az://.

storage_optionsdict, optional

Credential/config keyword arguments forwarded to the Icechunk storage backend (e.g. {'from_env': True} for S3).

sidecar_optionsdict, optional

obstore config kwargs used only to read the JSON sidecar. When omitted, storage_options is reused for the sidecar read.

xarray_kwargsdict, optional

Keyword arguments forwarded to xarray.open_zarr().

virtual_chunk_modeldict or VirtualChunkContainerModel, optional

Pre-loaded virtual chunk container configuration. Supplying this skips the sidecar read and is mainly used by from_json and search.

catalog_idstr, optional

Catalog identifier loaded from a JSON sidecar.

intake_kwargsdict, optional

Additional keyword arguments passed through to Catalog.

Attributes:
columns_with_iterables

Return a set of column names that contain iterable values (e.g. lists).

This is needed to know which columns to unpack when doing searches with iterable query values.

df

Return a DataFrame of all catalog entry metadata.

Each row corresponds to one Zarr group (catalog entry). The key column holds the group path; remaining columns are drawn from each group’s .zattrs as written by IcechunkStoreBuilder.

Methods

from_json(json_file, *[, xarray_kwargs, ...])

Load an IcechunkCatalog from a JSON sidecar file.

keys()

Get keys for the catalog entries (one per top-level Zarr group in the store).

save(name, *[, directory, json_dump_kwargs])

Save a JSON sidecar file pointing to this catalog's Icechunk store.

search(**query)

Search for entries in the catalog by matching group .zattrs.

to_dask(*args, **kwargs)

Return a dask container for this data source

to_dataset_dict([xarray_kwargs, ...])

Load catalog entries into a dictionary of xarray Datasets.

to_xarray(**kwargs)

Return the catalog as a single xarray Dataset.

unique()

Get the number of unique values for each column in the catalog DataFrame.

Examples

Open a catalog saved by IcechunkStoreBuilder:

>>> import intake
>>> cat = intake.open_virtual_icechunk('/path/to/store')
>>> cat.keys()
['CMIP.BCC.BCC-ESM1.historical', 'CMIP.BCC.BCC-ESM1.ssp585']
>>> ds = cat['CMIP.BCC.BCC-ESM1.historical'].to_xarray()

Or load from a JSON sidecar:

>>> cat = IcechunkCatalog.from_json('/path/to/catalog.json')
classmethod from_json(json_file, *, xarray_kwargs=None, storage_options=None)

Load an IcechunkCatalog from a JSON sidecar file.

Parameters:
json_filestr

Path or URL to the catalog JSON file produced by save() or IcechunkStoreBuilder.

xarray_kwargsdict, optional

Keyword arguments forwarded to xarray.open_zarr().

storage_optionsdict, optional

obstore config kwargs for reading the JSON file itself (not for the Icechunk store — those are embedded in the JSON).

__init__(store, *, storage_options=None, sidecar_options=None, xarray_kwargs=None, virtual_chunk_model=None, catalog_id=None, **intake_kwargs)
Parameters:
entriesdict, optional

Mapping of {name: entry}

namestr, optional

Unique identifier for catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.

descriptionstr, optional

Description of the catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.

metadata: dict

Additional information about this data

ttlfloat, optional

Lifespan (time to live) of cached modification time. Units are in seconds. Defaults to 1.

getenv: bool

Can parameter default fields take values from the environment

getshell: bool

Can parameter default fields run shell commands

persist_mode: [‘always’, ‘default’, ‘never’]

Defines the use of persisted sources: if ‘always’, will use a persisted version of a data source, if it exists, if ‘never’ will always use the original source. If ‘default’, persisted sources will be used if they have not expired, and re-persisted and used if they have.

storage_optionsdict

If using a URL beginning with ‘intake://’ (remote Intake server), parameters to pass to requests when issuing http commands; otherwise parameters to pass to remote backend file-system. Ignored for normal local files.

keys()

Get keys for the catalog entries (one per top-level Zarr group in the store).

Returns:
list of str

Group path keys, one per dataset written by IcechunkStoreBuilder. When this catalog is the result of search(), only the matching keys are returned.

save(name, *, directory=None, json_dump_kwargs=None)

Save a JSON sidecar file pointing to this catalog’s Icechunk store.

Parameters:
namestr

Stem of the output file (without the .json extension).

directorystr, optional

Directory to write the file to. Defaults to the current working directory.

json_dump_kwargsdict, optional

Additional keyword arguments forwarded to json.dump().

search(**query)

Search for entries in the catalog by matching group .zattrs.

Parameters:
require_all_onstr or list of str, optional

If specified, the given column(s) must match all values in the query. Mostly for back compatibility with intake-esm, although I don’t really understand it & I’m not sure it should be kept

**query

Each keyword maps to a .zattrs attribute name. The value may be a scalar or a list of allowed values.

Returns:
IcechunkCatalog

A new catalog containing only the matching entries. The underlying Icechunk store is shared — it is not re-opened.

Examples

>>> cat.search(source_id='BCC-ESM1')
>>> cat.search(experiment_id=['historical', 'ssp585'])
>>> cat.search(source_id='BCC-ESM1', experiment_id='historical')
to_dask(*args, **kwargs)

Return a dask container for this data source

to_dataset_dict(xarray_kwargs=None, progressbar=True, preprocess=None, storage_options=None)

Load catalog entries into a dictionary of xarray Datasets.

Parameters:
xarray_kwargsdict, optional

Keyword arguments forwarded to xarray.open_zarr(). Merged with (and taking precedence over) the xarray_kwargs supplied at construction time.

progressbarbool, optional

If True, display a progress bar while loading datasets.

preprocesscallable, optional

A callable with the signature preprocess(ds: xr.Dataset) -> xr.Dataset applied to each dataset immediately after loading, mirroring the preprocess argument of xarray.open_mfdataset().

storage_optionsdict, optional

Storage credentials/config merged with (and taking precedence over) the catalog-level storage_options before constructing each data source. Retained for API parity with intake-esm; the already-opened Icechunk store object does not use these options.

Returns:
dict of str -> xarray.Dataset

One Dataset per catalog entry, keyed by the group path.

to_xarray(**kwargs)

Return the catalog as a single xarray Dataset.

Only valid when the catalog contains exactly one entry.

Parameters:
**kwargs

Passed through to to_dataset_dict().

Returns:
xarray.Dataset
Raises:
ValueError

If the catalog contains zero or more than one entry.

unique()

Get the number of unique values for each column in the catalog DataFrame.

Iterable-valued columns are exploded before counting so their values are counted individually rather than as whole tuples.

property columns_with_iterables

Return a set of column names that contain iterable values (e.g. lists).

This is needed to know which columns to unpack when doing searches with iterable query values.

property df

Return a DataFrame of all catalog entry metadata.

Each row corresponds to one Zarr group (catalog entry). The key column holds the group path; remaining columns are drawn from each group’s .zattrs as written by IcechunkStoreBuilder.

class intake_virtual_icechunk._source.IcechunkDataSource(*args, **kwargs)

An intake-compatible Data Source for a single Zarr group in an Icechunk store.

This is the per-entry source returned by IcechunkCatalog when a key is looked up. It mirrors ESMDataSource so the two plugins feel identical to callers.

Parameters:
keystr

The catalog key / Zarr group path for this dataset.

storeicechunk.IcechunkStore

An already-opened, zarr-compatible IcechunkStore. Obtain one via IcechunkCatalog._zarr_store (or icechunk.Repository.open(...).readonly_session('main').store). Passing a pre-opened store avoids re-opening the repository for every data source.

groupstr

Zarr group path within the store to open.

storage_optionsdict, optional

Retained for API compatibility; not used when store is already an IcechunkStore.

xarray_kwargsdict, optional

Keyword arguments forwarded to xarray.open_zarr().

intake_kwargsdict, optional

Additional keyword arguments passed through to DataSource.

Attributes:
ds

The xarray Dataset for this data source.

Methods

close()

Drop the open dataset from memory.

to_xarray()

Return the xarray Dataset (with dask-backed arrays).

__init__(key, store, group, *, storage_options=None, xarray_kwargs=None, intake_kwargs=None)
close()

Drop the open dataset from memory.

to_xarray()

Return the xarray Dataset (with dask-backed arrays).

property ds

The xarray Dataset for this data source.

class intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel(*, id='', version='1.0.0', store, virtual_chunk_model=None, description=None, title=None, last_updated=None, storage_options={})

Pydantic model for a Virtual Icechunk catalog sidecar file.

The sidecar JSON is a lightweight pointer to an Icechunk store together with catalog-level metadata. All per-entry (dataset) metadata is stored in each Zarr group’s .zattrs, written by IcechunkStoreBuilder.

Methods

load(json_file[, storage_options])

Load a catalog model from a JSON sidecar file.

save(name, *, store[, json_dump_kwargs])

Save the catalog model to a JSON sidecar file.

Examples

Save a catalog pointer:

>>> model = VirtualIcechunkCatalogModel(
...     store='s3://my-bucket/my-catalog.icechunk',
...     virtual_chunk_model=virtual_chunk_model,
...     storage_options={'from_env': True},
...     description='My climate catalog',
... )
>>> from obstore.store import from_url
>>> model.save('my-catalog', store=from_url('file:///path/to/output'))

Load it back:

>>> model = VirtualIcechunkCatalogModel.load('/path/to/output/my-catalog.json')
classmethod load(json_file, storage_options=None)

Load a catalog model from a JSON sidecar file.

Parameters:
json_filestr

Path or URL to the JSON sidecar file.

storage_optionsdict, optional

obstore config kwargs for reading the JSON file itself (e.g. S3 credentials). These are independent of the catalog’s own storage_options, which are stored inside the JSON and used to open the Icechunk store.

save(name, *, store, json_dump_kwargs=None)

Save the catalog model to a JSON sidecar file.

Parameters:
namestr

Stem of the output file. If it ends with ‘.json’ it will be stripped and re-added to ensure we get a single .json ext, no matter what.

storeObjectStore

An obstore store (e.g. S3Store, LocalStore) pointing at the directory into which the sidecar should be written.

json_dump_kwargsdict, optional

Additional keyword arguments forwarded to json.dump().

model_config = {'validate_assignment': True}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class intake_virtual_icechunk.source._build.IcechunkStoreBuilder(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None)

Build a real Icechunk store by copying data from an intake-esm datastore.

Given a pre-built intake-esm catalog, this builder iterates over every dataset group in the catalog, opens the constituent files with xarray.open_mfdataset, and writes each dataset as a named Zarr group inside a single Icechunk store. Data chunks are copied into the Icechunk store, so source files do not need to remain accessible once the build is complete.

The resulting store requires no virtual chunk container configuration: it can be opened with IcechunkCatalog directly, without supplying any source-data credentials.

Parameters:
esm_datastore_pathPath or str

Path to an existing intake-esm catalog JSON file. Stored internally as a string.

icechunk_store_pathPath or str

Path or URI at which to create the Icechunk store. Supported schemes: local path, s3://, gs:// / gcs://, az://.

esm_datastore_kwargsdict, optional

Keyword arguments forwarded to intake.open_esm_datastore.

icechunk_storage_optionsdict, optional

Keyword arguments forwarded to the Icechunk storage backend for the target store. See intake_virtual_icechunk.utils._resolve_storage().

xarray_kwargsdict, optional

Keyword arguments forwarded to xarray.open_mfdataset when reading each group’s source files (e.g. {'decode_times': False}).

drop_colslist[str], optional

Column names in the intake-esm catalog’s assets dataframe to omit from attached Zarr group metadata. The asset path column is always omitted.

cols_to_deiterlist[str], optional

Columns whose deduplicated iterable metadata should be stored as a scalar by taking the first value.

xarray_kwargs: list[dict] | dict | None

Passed to xarray.open_mfdataset/open_dataset. If a list of dicts is passed, it must be the same length as the number of datasets in the datastore, and will be applied per dataset. If a single dict is passed, the same args will be passed to each dataset. If None, default arguments will be used. Combine related kwargs are dropped if passed to open_dataset.

Methods

build()

Build the Icechunk store by copying real data from the source assets.

__init__(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None)

Initialise a builder that copies real data into the Icechunk store.

build()

Build the Icechunk store by copying real data from the source assets.

For each dataset group in the intake-esm catalog:

  1. Collects the asset file paths that belong to the group.

  2. Opens the files with xarray.open_mfdataset.

  3. Writes the dataset as a named Zarr group in the Icechunk store.

  4. Writes the group’s groupby_attrs values into .zattrs.

After all groups are written, a JSON sidecar is saved alongside the store for use with from_json(). The sidecar does not contain a virtual chunk container entry because the store holds real data chunks.

Notes

Like the virtual builder path, this implementation records per-group failures in self.failed_list and continues building the remaining groups.

class intake_virtual_icechunk.source._containers.VirtualChunkContainerModel(url_prefix, store_type, open_kwargs=<factory>)

Serializable VirtualChunkContainer configuration for catalog sidecars.

Icechunk requires virtual chunk access to be configured explicitly when a repository is reopened. This model stores the non-secret parts of that configuration in the catalog JSON sidecar so read-side consumers can reconstruct an equivalent VirtualChunkContainer later.

Only explicitly safe kwargs are preserved in open_kwargs; credential-like values are intentionally omitted from the serialised form.

Methods

from_dict(d)

Construct the model from a dictionary, typically decoded from JSON.

from_virtual_chunk_container(vc_container[, ...])

Build a serialisable model from a live Icechunk container.

to_dict()

Return a plain dictionary suitable for JSON serialisation.

to_virtual_chunk_container()

Recreate an Icechunk VirtualChunkContainer from this model.

classmethod from_dict(d)

Construct the model from a dictionary, typically decoded from JSON.

Returns None if d is None.

static from_virtual_chunk_container(vc_container, store_options=None)

Build a serialisable model from a live Icechunk container.

Parameters:
vc_container

The configured Icechunk virtual chunk container.

store_options

Source-store options from the builder. Only keys listed in _VCC_SAFE_KWARGS are retained in the serialised output.

__init__(url_prefix, store_type, open_kwargs=<factory>)
to_dict()

Return a plain dictionary suitable for JSON serialisation.

to_virtual_chunk_container()

Recreate an Icechunk VirtualChunkContainer from this model.