Data retention policies
The data plane uses an object store bucket (S3, GCS, or ABS) to hold the raw data of your workflows — files, directories, dataframes, models, and other large values offloaded from task inputs and outputs. Bucket-level retention and lifecycle policies (such as S3 lifecycle rules) on this bucket purge raw data and can break historical executions, caches, and trace-based recovery.
Where metadata vs. raw data lives
It helps to be precise about what “metadata” means here, because the term is used in two very different ways elsewhere in the industry.
Metadata in Union.ai lives in the control plane database. It includes:
- Task, trigger and app definitions (including their default input values).
- Execution status, history, schedules, and audit trail.
- Pointers (URIs) to each run’s
inputs.pb/outputs.pb— the database records where each task’s inputs and outputs live, not the values themselves.
Raw data lives in the data plane object-store bucket. It includes:
flyte.io.File/flyte.io.Dircontents.flyte.io.DataFramepayloads.- Models and other pickled / large objects.
Deckdata and artifact payloads.- Trace checkpoints.
Every task’s inputs and outputs are serialized to inputs.pb / outputs.pb in the data plane object-store bucket, and the database stores only a pointer (URI) to them. Within those files, small values (primitives, small dataclasses) are inlined by value, while values too large to inline — flyte.io.File, flyte.io.DataFrame, models, and similar — are offloaded to separate objects in the bucket and referenced by URI. Either way, the values themselves live in the data plane; the control plane holds only the pointer.
Impact of raw data loss
A retention policy that purges raw data leaves the metadata in the control plane database intact, but the references it holds to offloaded values become dangling pointers. The effects:
| Area | Impact |
|---|---|
| UI and APIs | Execution detail views still render (status, timing, structure all come from the DB), but input/output previews for offloaded values, Deck views, and artifact payload links resolve to purged blobs and fail with “resource not found.” |
| Execution engine | Re-runs or downstream tasks that consume a purged upstream output fail at runtime. In-flight tasks that depend on a node whose output was just purged fail. |
| Caching | A cache hit may resolve to a pointer whose underlying raw data has been purged, producing cache misses, task re-execution, or failure. |
| Traces |
Trace checkpoints used by @flyte.trace for fine-grained recovery are stored in the bucket; if purged, resume-from-checkpoint is not possible for affected executions. |
| Operations | The DB record of what ran and when, the pointers to each task’s inputs/outputs, and the small inline values in inputs.pb/outputs.pb are preserved. The large offloaded inputs/outputs are lost wherever the raw data has been purged. |
Applying retention deliberately
Retention on the raw-data bucket is a legitimate cost-management strategy as long as you accept the trade-off: historical executions that referenced purged data will no longer be re-runnable from cache or recoverable from trace checkpoints, and their UI views will show missing blobs. New executions are unaffected.
Be aware of the trade-offs in particular for:
- Cached task outputs — purging the cached raw data invalidates the cache; affected tasks re-execute on the next call.
- Trace checkpoints — purging prevents resume-from-checkpoint for executions whose checkpoints have aged out.
- Historical execution previews — purged raw data will show as “resource not found” in the UI even though the DB still has the rest of the record.
Data correctness is not silently violated: re-runs read from current raw data, and the DB record is the source of truth for what executed. You’re trading off the ability to recover or inspect old offloaded values.
Designing lifecycle rules
The Union.ai data plane organizes execution data under a single configured storage prefix, with sub-prefixes per project, domain, run, and action. Two broad categories of object share this layout:
- Execution working files —
inputs.pbandoutputs.pbper run/attempt,DeckHTML reports, and similar small per-execution artifacts. These are required for in-flight workflows to complete and for historical-execution input/output andDeckpreviews to render. Despite some legacy naming conventions, this is not Union.ai metadata in the customer-facing sense — that lives in the control plane database (see above). - Offloaded raw data —
flyte.io.File/flyte.io.Dircontents,flyte.io.DataFramepayloads, checkpoint data, and other values too large to inline. By default these land under the same configured storage prefix; they can be routed elsewhere per run viaflyte.with_runcontext(raw_data_path=...)(see Run context).
When designing S3 lifecycle rules (or the GCS/ABS equivalent), scope expiration to the offloaded raw-data subpaths rather than applying a bucket-wide rule. The execution working files (inputs.pb, outputs.pb, Decks) must remain durable for in-flight executions to complete and for historical-execution previews to render. Typical patterns are rules scoped to domain/project prefixes, or to per-run raw-data paths that have been routed to dedicated buckets via raw_data_path.
Validate any retention rule in a non-production environment before applying it broadly. For the full developer-facing map of what’s in the bucket, see Where your data lives.