blob: 710cf8131364ef3e73b29bb7a58e945421c5c73d [file] [log] [blame] [view]
# DiskCache Adapter
This adapter uses [diskcache](https://grantjenks.com/docs/diskcache/tutorial.html) to cache node execution on disk. The cache key is a tuple of the function's
`(source code, input a, ..., input n)`. This means, a function will only be executed once for a given set of inputs,
and source code hash. The cache is stored in a directory of your choice, and it can be shared across different runs of your
code. That way as you develop, if the inputs and the code haven't changed, the function will not be executed again and
instead the result will be retrieved from the cache.
> 💡 This can be a great tool for developing inside a Jupyter notebook or other interactive environments.
Disk cache has great features to:
- set maximum cache size
- set automated eviction policy once maximum size is reached
- allow custom `Disk` implementations to change the serialization protocol (e.g., pickle, JSON)
> ⚠ The default `Disk` serializes objects using the `pickle` module. Changing Python or library versions could break your
> cache (both keys and values). Learn more about [caveats](https://grantjenks.com/docs/diskcache/tutorial.html#caveats).
> ❓ To store artifacts robustly, please use Hamilton materializers or the
> [CachingGraphAdapter](https://github.com/DAGWorks-Inc/hamilton/tree/main/examples/caching_nodes/caching_graph_adatper) instead.
> The `CachingGraphAdapter` stores tagged nodes directly on the file system using common formats (JSON, CSV, Parquet, etc.).
> However, it isn't aware of your function version and requires you to manually manage your disk space.
# How to use it
## Use the hook
Find it under plugins at `hamilton.plugins.h_diskcache` and add it to your Driver definition.
```python
from hamilton import driver
from hamilton.plugins import h_diskcache
import functions
dr = (
driver.Builder()
.with_modules(functions)
.with_adapters(h_diskcache.DiskCacheAdapter())
.build()
)
```
## Inspect the hook
To inspect the caching behavior in real-time, you can get the logger:
```python
logger = logging.getLogger("hamilton.plugins.h_diskcache")
logger.setLevel(logging.DEBUG) # or logging.INFO
logger.addHandler(logging.StreamHandler())
```
- INFO will only return the total cache after executing the Driver
- DEBUG will return inputs for each node and specify if the value is `from cache` or `executed`
## Clear cache
The utility function `h_diskcache.evict_all_except_driver` allows you to clear cached values for all nodes except those in the passed driver.
This is an efficient tool to clear old artifacts as your project evolves.
```python
from hamilton import driver
from hamilton.plugins import h_diskcache
import functions
dr = (
driver.Builder()
.with_modules(functions)
.with_adapters(h_diskcache.DiskCacheAdapter())
.build()
)
h_diskcache.evict_all_except_driver(dr)
```
## Cache settings
Find all the cache settings in the [diskcache docs](https://grantjenks.com/docs/diskcache/api.html#constants).