Where your data lives
When you run a Flyte task, your data ends up in two stores: a database in the control plane and an object-store bucket in the data plane. This page is the developer-facing map of which is which, and clears up the word “metadata,” which Flyte uses for several unrelated things.
The two stores
| Control-plane database | Data-plane object store | |
|---|---|---|
| Backing tech | Postgres (plus a few internal coordination stores) | S3, GCS, or ABS bucket |
| What’s in it | Every record Flyte uses to describe your runs, plus pointers to where each run’s inputs and outputs live | Every run’s inputs and outputs, and all bulk/offloaded content |
| Lifetime | Durable; long-lived history | Durable, but you can apply lifecycle/retention rules |
Who manages each store depends on your deployment model:
- Serverless — both the control-plane database and the data-plane bucket are managed by Union.ai.
- BYOC and Self-managed — the data-plane bucket lives in your cloud account; the control-plane database is managed by Union.ai (BYOC) or by you (Self-managed).
The database is the source of truth for what executed. The bucket is where your runs’ actual input and output values live.
What goes in the database
The control-plane database holds everything Flyte needs to enumerate, schedule, and replay your work. Specifically:
- Registrations — every task you’ve deployed, every trigger you’ve registered, every project and domain. A task’s definition includes its default input values, which are stored inline as part of the registration.
- Execution records — every run, every action (task / trace / condition) inside that run, attempts, phases, timing, error messages, parent/child relationships.
- Schedules and triggers —
Cron, event triggers, and their revision history. - Pointers to runtime inputs and outputs — the database stores the URI of each run’s
inputs.pb/outputs.pb, not the values themselves. (One exception: an awaited condition / approval action stores the value that satisfies it inline.) - Caches — the cache key → output-URI mapping for
@env.task(cache=...).
The values your tasks actually pass at runtime — even a bare int — do not live in the database. They are written to inputs.pb / outputs.pb in the bucket, and the database keeps only the pointer. See the next section.
(Internally, Flyte uses several backing databases — Postgres for registrations and run history, separate stores for in-flight action coordination and caches. For developer purposes the only thing that matters is that they’re all small-record, structured stores; none of them hold bulk content.)
What goes in the bucket
Every run’s inputs and outputs are written to the bucket as inputs.pb / outputs.pb, and the database stores a pointer (URI) to them. Within those files, small scalar values are inlined directly while large values are offloaded to separate objects and referenced by URI. The bucket holds:
- Task inputs, serialized as
inputs.pbper run. - Task outputs, serialized as
outputs.pbper attempt. - Offloaded values —
flyte.io.File,flyte.io.Dir,flyte.io.DataFrame, pickled objects, models, anything large. - Decks — the HTML reports your task renders.
- Trace checkpoints — used by
@flyte.traceto resume partial work. - Fast-registered code bundles — what
flyte deployandflyte run --copy-style allupload so the cluster can run your local Python. - Image-build contexts — when Union.ai builds a container image from an
Imagedefinition that requires a build context.
The layout under your bucket is <project>/<domain>/..., with the bulk of execution artifacts under per-run, per-action subprefixes (<run-name>/<action>/... for outputs / Decks / checkpoints) and sibling prefixes for offloaded inputs and SDK uploads (code bundles, image-build contexts). You don’t typically need to know the exact paths; you do need to know that everything above lives behind one configured bucket prefix.
What “metadata” means
The word “metadata” appears in several places and means a different thing each time. The two senses that matter for developers:
1. “Metadata” as in the control-plane database (Flyte’s usage)
When Flyte documentation says “metadata is preserved” or “metadata lives in the control plane,” it means the database records above: registrations (including task default values), run history, and status. It does not mean “the contents of the bucket.”
This is the sense most relevant to you: the database is durable, and losing the bucket does not lose your execution history — it loses the large values those history records pointed at.
2. “Metadata bucket” (a deployment/ops term you may see)
The Helm chart and some operational guides refer to a “metadata bucket” or metadataContainer. This is a legacy name. The bucket it refers to does not hold the database-style metadata above — it holds inputs.pb, outputs.pb, Decks, checkpoints, code bundles, and offloaded data. In other words, it holds exactly the “bucket” contents listed in the previous section.
If you see “metadata bucket” in an ops context, read it as “the data-plane object-store bucket.” The naming is unfortunate; the contents are what you’d expect from a data bucket.
You can largely ignore other appearances of the word in API surfaces (TaskMetadata, ActionMetadata, and metadata_path on RunContext, which is a local scratch directory used only by from_local() execution) — those are small property bags or local scratch paths and don’t change where your data is stored.
Per-run customization: raw_data_path
By default, offloaded values (File, Dir, DataFrame, checkpoints) land alongside everything else under the deployment’s configured bucket prefix. You can route them to a different prefix — including a different bucket entirely — for a single run:
import flyte
flyte.init_from_config()
run = flyte.with_runcontext(
raw_data_path="s3://my-other-bucket/some/prefix",
).run(my_task, x=1)This is the supported way to send a sensitive run to an isolated bucket, point at a bucket with different lifecycle rules, or otherwise route offloaded data per run. The inputs.pb / outputs.pb themselves still land in the deployment’s bucket; only the raw offloaded contents move.
See
Run context for the full set of with_runcontext options.
What happens if the bucket is purged
If a retention rule deletes objects out of the bucket, the database records that pointed at them are not deleted — but their pointers now dangle. Concretely:
- Execution history, status, timing, structure: still visible in the UI. They come from the database.
- Input/output previews, Deck views, artifact payloads: show “not found” if the underlying bytes were purged.
- Cache hits for purged outputs: the cached pointer is dead, the task re-executes.
- Trace resumption: not possible if the checkpoint blob is gone.
- Re-running an old execution: fails if any input it needs has been purged.
This is the trade-off behind retention policies: you save storage cost at the price of being able to inspect or re-run old executions whose offloaded values have aged out. New executions are unaffected.
Lifecycle / retention rules should be scoped to the offloaded-data prefixes, not applied bucket-wide — inputs.pb and outputs.pb are needed for in-flight executions to complete, so purging them mid-run breaks things.
The short version
- Database = the system of record. Holds registrations (including task default values), run history, schedules, and pointers to each run’s inputs/outputs.
- Bucket = the object-store bucket. Holds every run’s
inputs.pb/outputs.pb, Decks, checkpoints, code bundles, and offloadedFile/Dir/DataFramecontents. - “Metadata” in docs usually means database-side records. “Metadata bucket” in Helm/ops is legacy naming for the data-plane bucket — it does not hold database metadata.
flyte.with_runcontext(raw_data_path=...)is your knob to send offloaded data elsewhere per run.