API Reference#
intake_virtual_icechunk provides a few core public components:
intake_virtual_icechunk.core.IcechunkCatalog: the main intake catalog implementation, registered as thevirtual_icechunkdriver.intake_virtual_icechunk._source.IcechunkDataSource: the per-entry data source returned when you index into anIcechunkCatalog.intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel: the JSON sidecar model used to persist catalog metadata and reopen a store later.intake_virtual_icechunk.source._build.IcechunkStoreBuilder: builds a virtual Icechunk store from a pre-built intake-esm catalog.intake_virtual_icechunk.source._containers.VirtualChunkContainerModel: stores enough virtual chunk container configuration to round-trip a catalog safely.
The following API summary is auto-generated.
- class intake_virtual_icechunk.core.IcechunkCatalog(*args, **kwargs)
An intake plugin for reading an Icechunk store built from an intake-esm catalog.
The store contains one Zarr group per dataset, written by
IcechunkStoreBuilder. Per-entry metadata (the attributes used for searching) is stored in each group’s.zattrs. This catalog mirrors theesm_datastoreAPI so that switching between the two is straightforward.Registered as the
virtual_icechunkintake driver, so it is accessible viaintake.open_virtual_icechunk().- Parameters:
- storestr
Path or URI to the Icechunk store. Supported schemes: local path,
s3://,gs:///gcs://,az://.- storage_optionsdict, optional
Credential/config keyword arguments forwarded to the Icechunk storage backend (e.g.
{'from_env': True}for S3).- sidecar_optionsdict, optional
obstore config kwargs used only to read the JSON sidecar. When omitted, storage_options is reused for the sidecar read.
- xarray_kwargsdict, optional
Keyword arguments forwarded to
xarray.open_zarr().- virtual_chunk_modeldict or VirtualChunkContainerModel, optional
Pre-loaded virtual chunk container configuration. Supplying this skips the sidecar read and is mainly used by
from_jsonandsearch.- catalog_idstr, optional
Catalog identifier loaded from a JSON sidecar.
- intake_kwargsdict, optional
Additional keyword arguments passed through to
Catalog.
- Attributes:
- columns_with_iterables
Return a set of column names that contain iterable values (e.g. lists).
This is needed to know which columns to unpack when doing searches with iterable query values.
- df
Return a
DataFrameof all catalog entry metadata.Each row corresponds to one Zarr group (catalog entry). The
keycolumn holds the group path; remaining columns are drawn from each group’s.zattrsas written byIcechunkStoreBuilder.
Methods
from_json(json_file, *[, xarray_kwargs, ...])Load an
IcechunkCatalogfrom a JSON sidecar file.keys()Get keys for the catalog entries (one per top-level Zarr group in the store).
save(name, *[, directory, json_dump_kwargs])Save a JSON sidecar file pointing to this catalog's Icechunk store.
search(**query)Search for entries in the catalog by matching group
.zattrs.to_dask(*args, **kwargs)Return a dask container for this data source
to_dataset_dict([xarray_kwargs, ...])Load catalog entries into a dictionary of xarray Datasets.
to_xarray(**kwargs)Return the catalog as a single xarray Dataset.
unique()Get the number of unique values for each column in the catalog DataFrame.
Examples
Open a catalog saved by
IcechunkStoreBuilder:>>> import intake >>> cat = intake.open_virtual_icechunk('/path/to/store') >>> cat.keys() ['CMIP.BCC.BCC-ESM1.historical', 'CMIP.BCC.BCC-ESM1.ssp585'] >>> ds = cat['CMIP.BCC.BCC-ESM1.historical'].to_xarray()
Or load from a JSON sidecar:
>>> cat = IcechunkCatalog.from_json('/path/to/catalog.json')
- classmethod from_json(json_file, *, xarray_kwargs=None, storage_options=None)
Load an
IcechunkCatalogfrom a JSON sidecar file.- Parameters:
- json_filestr
Path or URL to the catalog JSON file produced by
save()orIcechunkStoreBuilder.- xarray_kwargsdict, optional
Keyword arguments forwarded to
xarray.open_zarr().- storage_optionsdict, optional
obstore config kwargs for reading the JSON file itself (not for the Icechunk store — those are embedded in the JSON).
- __init__(store, *, storage_options=None, sidecar_options=None, xarray_kwargs=None, virtual_chunk_model=None, catalog_id=None, **intake_kwargs)
- Parameters:
- entriesdict, optional
Mapping of {name: entry}
- namestr, optional
Unique identifier for catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.
- descriptionstr, optional
Description of the catalog. This takes precedence over whatever is stated in the cat file itself. Defaults to None.
- metadata: dict
Additional information about this data
- ttlfloat, optional
Lifespan (time to live) of cached modification time. Units are in seconds. Defaults to 1.
- getenv: bool
Can parameter default fields take values from the environment
- getshell: bool
Can parameter default fields run shell commands
- persist_mode: [‘always’, ‘default’, ‘never’]
Defines the use of persisted sources: if ‘always’, will use a persisted version of a data source, if it exists, if ‘never’ will always use the original source. If ‘default’, persisted sources will be used if they have not expired, and re-persisted and used if they have.
- storage_optionsdict
If using a URL beginning with ‘intake://’ (remote Intake server), parameters to pass to requests when issuing http commands; otherwise parameters to pass to remote backend file-system. Ignored for normal local files.
- keys()
Get keys for the catalog entries (one per top-level Zarr group in the store).
- Returns:
- list of str
Group path keys, one per dataset written by
IcechunkStoreBuilder. When this catalog is the result ofsearch(), only the matching keys are returned.
- save(name, *, directory=None, json_dump_kwargs=None)
Save a JSON sidecar file pointing to this catalog’s Icechunk store.
- Parameters:
- namestr
Stem of the output file (without the
.jsonextension).- directorystr, optional
Directory to write the file to. Defaults to the current working directory.
- json_dump_kwargsdict, optional
Additional keyword arguments forwarded to
json.dump().
- search(**query)
Search for entries in the catalog by matching group
.zattrs.- Parameters:
- require_all_onstr or list of str, optional
If specified, the given column(s) must match all values in the query. Mostly for back compatibility with intake-esm, although I don’t really understand it & I’m not sure it should be kept
- **query
Each keyword maps to a
.zattrsattribute name. The value may be a scalar or a list of allowed values.
- Returns:
- IcechunkCatalog
A new catalog containing only the matching entries. The underlying Icechunk store is shared — it is not re-opened.
Examples
>>> cat.search(source_id='BCC-ESM1') >>> cat.search(experiment_id=['historical', 'ssp585']) >>> cat.search(source_id='BCC-ESM1', experiment_id='historical')
- to_dask(*args, **kwargs)
Return a dask container for this data source
- to_dataset_dict(xarray_kwargs=None, progressbar=True, preprocess=None, storage_options=None)
Load catalog entries into a dictionary of xarray Datasets.
- Parameters:
- xarray_kwargsdict, optional
Keyword arguments forwarded to
xarray.open_zarr(). Merged with (and taking precedence over) the xarray_kwargs supplied at construction time.- progressbarbool, optional
If
True, display a progress bar while loading datasets.- preprocesscallable, optional
A callable with the signature
preprocess(ds: xr.Dataset) -> xr.Datasetapplied to each dataset immediately after loading, mirroring thepreprocessargument ofxarray.open_mfdataset().- storage_optionsdict, optional
Storage credentials/config merged with (and taking precedence over) the catalog-level
storage_optionsbefore constructing each data source. Retained for API parity withintake-esm; the already-opened Icechunk store object does not use these options.
- Returns:
- dict of str -> xarray.Dataset
One Dataset per catalog entry, keyed by the group path.
- to_xarray(**kwargs)
Return the catalog as a single xarray Dataset.
Only valid when the catalog contains exactly one entry.
- Parameters:
- **kwargs
Passed through to
to_dataset_dict().
- Returns:
- xarray.Dataset
- Raises:
- ValueError
If the catalog contains zero or more than one entry.
- unique()
Get the number of unique values for each column in the catalog DataFrame.
Iterable-valued columns are exploded before counting so their values are counted individually rather than as whole tuples.
- property columns_with_iterables
Return a set of column names that contain iterable values (e.g. lists).
This is needed to know which columns to unpack when doing searches with iterable query values.
- property df
Return a
DataFrameof all catalog entry metadata.Each row corresponds to one Zarr group (catalog entry). The
keycolumn holds the group path; remaining columns are drawn from each group’s.zattrsas written byIcechunkStoreBuilder.
- class intake_virtual_icechunk._source.IcechunkDataSource(*args, **kwargs)
An intake-compatible Data Source for a single Zarr group in an Icechunk store.
This is the per-entry source returned by
IcechunkCatalogwhen a key is looked up. It mirrorsESMDataSourceso the two plugins feel identical to callers.- Parameters:
- keystr
The catalog key / Zarr group path for this dataset.
- storeicechunk.IcechunkStore
An already-opened, zarr-compatible
IcechunkStore. Obtain one viaIcechunkCatalog._zarr_store(oricechunk.Repository.open(...).readonly_session('main').store). Passing a pre-opened store avoids re-opening the repository for every data source.- groupstr
Zarr group path within the store to open.
- storage_optionsdict, optional
Retained for API compatibility; not used when store is already an
IcechunkStore.- xarray_kwargsdict, optional
Keyword arguments forwarded to
xarray.open_zarr().- intake_kwargsdict, optional
Additional keyword arguments passed through to
DataSource.
- Attributes:
- ds
The xarray Dataset for this data source.
Methods
close()Drop the open dataset from memory.
to_xarray()Return the xarray Dataset (with dask-backed arrays).
- __init__(key, store, group, *, storage_options=None, xarray_kwargs=None, intake_kwargs=None)
- close()
Drop the open dataset from memory.
- to_xarray()
Return the xarray Dataset (with dask-backed arrays).
- property ds
The xarray Dataset for this data source.
- class intake_virtual_icechunk.cat.VirtualIcechunkCatalogModel(*, id='', version='1.0.0', store, virtual_chunk_model=None, description=None, title=None, last_updated=None, storage_options={})
Pydantic model for a Virtual Icechunk catalog sidecar file.
The sidecar JSON is a lightweight pointer to an Icechunk store together with catalog-level metadata. All per-entry (dataset) metadata is stored in each Zarr group’s
.zattrs, written byIcechunkStoreBuilder.Methods
load(json_file[, storage_options])Load a catalog model from a JSON sidecar file.
save(name, *, store[, json_dump_kwargs])Save the catalog model to a JSON sidecar file.
Examples
Save a catalog pointer:
>>> model = VirtualIcechunkCatalogModel( ... store='s3://my-bucket/my-catalog.icechunk', ... virtual_chunk_model=virtual_chunk_model, ... storage_options={'from_env': True}, ... description='My climate catalog', ... ) >>> from obstore.store import from_url >>> model.save('my-catalog', store=from_url('file:///path/to/output'))
Load it back:
>>> model = VirtualIcechunkCatalogModel.load('/path/to/output/my-catalog.json')
- classmethod load(json_file, storage_options=None)
Load a catalog model from a JSON sidecar file.
- Parameters:
- json_filestr
Path or URL to the JSON sidecar file.
- storage_optionsdict, optional
obstore config kwargs for reading the JSON file itself (e.g. S3 credentials). These are independent of the catalog’s own
storage_options, which are stored inside the JSON and used to open the Icechunk store.
- save(name, *, store, json_dump_kwargs=None)
Save the catalog model to a JSON sidecar file.
- Parameters:
- namestr
Stem of the output file. If it ends with ‘.json’ it will be stripped and re-added to ensure we get a single .json ext, no matter what.
- storeObjectStore
An obstore store (e.g.
S3Store,LocalStore) pointing at the directory into which the sidecar should be written.- json_dump_kwargsdict, optional
Additional keyword arguments forwarded to
json.dump().
- model_config = {'validate_assignment': True}
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- class intake_virtual_icechunk.source._build.IcechunkStoreBuilder(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None)
Build a real Icechunk store by copying data from an intake-esm datastore.
Given a pre-built intake-esm catalog, this builder iterates over every dataset group in the catalog, opens the constituent files with
xarray.open_mfdataset, and writes each dataset as a named Zarr group inside a single Icechunk store. Data chunks are copied into the Icechunk store, so source files do not need to remain accessible once the build is complete.The resulting store requires no virtual chunk container configuration: it can be opened with
IcechunkCatalogdirectly, without supplying any source-data credentials.- Parameters:
- esm_datastore_pathPath or str
Path to an existing intake-esm catalog JSON file. Stored internally as a string.
- icechunk_store_pathPath or str
Path or URI at which to create the Icechunk store. Supported schemes: local path,
s3://,gs:///gcs://,az://.- esm_datastore_kwargsdict, optional
Keyword arguments forwarded to
intake.open_esm_datastore.- icechunk_storage_optionsdict, optional
Keyword arguments forwarded to the Icechunk storage backend for the target store. See
intake_virtual_icechunk.utils._resolve_storage().- xarray_kwargsdict, optional
Keyword arguments forwarded to
xarray.open_mfdatasetwhen reading each group’s source files (e.g.{'decode_times': False}).- drop_colslist[str], optional
Column names in the intake-esm catalog’s assets dataframe to omit from attached Zarr group metadata. The asset path column is always omitted.
- cols_to_deiterlist[str], optional
Columns whose deduplicated iterable metadata should be stored as a scalar by taking the first value.
- xarray_kwargs: list[dict] | dict | None
Passed to xarray.open_mfdataset/open_dataset. If a list of dicts is passed, it must be the same length as the number of datasets in the datastore, and will be applied per dataset. If a single dict is passed, the same args will be passed to each dataset. If None, default arguments will be used. Combine related kwargs are dropped if passed to open_dataset.
Methods
build()Build the Icechunk store by copying real data from the source assets.
- __init__(*, esm_datastore_path, icechunk_store_path, esm_datastore_kwargs=None, icechunk_storage_options=None, xarray_kwargs=None, drop_cols=None, cols_to_deiter=None)
Initialise a builder that copies real data into the Icechunk store.
- build()
Build the Icechunk store by copying real data from the source assets.
For each dataset group in the intake-esm catalog:
Collects the asset file paths that belong to the group.
Opens the files with
xarray.open_mfdataset.Writes the dataset as a named Zarr group in the Icechunk store.
Writes the group’s
groupby_attrsvalues into.zattrs.
After all groups are written, a JSON sidecar is saved alongside the store for use with
from_json(). The sidecar does not contain a virtual chunk container entry because the store holds real data chunks.Notes
Like the virtual builder path, this implementation records per-group failures in
self.failed_listand continues building the remaining groups.
- class intake_virtual_icechunk.source._containers.VirtualChunkContainerModel(url_prefix, store_type, open_kwargs=<factory>)
Serializable VirtualChunkContainer configuration for catalog sidecars.
Icechunk requires virtual chunk access to be configured explicitly when a repository is reopened. This model stores the non-secret parts of that configuration in the catalog JSON sidecar so read-side consumers can reconstruct an equivalent
VirtualChunkContainerlater.Only explicitly safe kwargs are preserved in
open_kwargs; credential-like values are intentionally omitted from the serialised form.Methods
from_dict(d)Construct the model from a dictionary, typically decoded from JSON.
from_virtual_chunk_container(vc_container[, ...])Build a serialisable model from a live Icechunk container.
to_dict()Return a plain dictionary suitable for JSON serialisation.
to_virtual_chunk_container()Recreate an Icechunk
VirtualChunkContainerfrom this model.- classmethod from_dict(d)
Construct the model from a dictionary, typically decoded from JSON.
Returns
Noneif d isNone.
- static from_virtual_chunk_container(vc_container, store_options=None)
Build a serialisable model from a live Icechunk container.
- Parameters:
- vc_container
The configured Icechunk virtual chunk container.
- store_options
Source-store options from the builder. Only keys listed in
_VCC_SAFE_KWARGSare retained in the serialised output.
- __init__(url_prefix, store_type, open_kwargs=<factory>)
- to_dict()
Return a plain dictionary suitable for JSON serialisation.
- to_virtual_chunk_container()
Recreate an Icechunk
VirtualChunkContainerfrom this model.