The anti vendor lock-in warehouse
Peter Hicks
I have always felt a modicum of self-reproach about all the inadequate vendors I've kept around because the marginal cost of leaving is such a burden that I ended up not wanting to incur any additional short-term work. It's a bananas way to live, possessed by the demons of sunk cost theory. It is also how a remarkable amount of modern data infrastructure gets purchased, renewed, defended in planning meetings, and eventually embalmed in architecture diagrams for worse or worse.
Spark should not be impossible
For Spark, we did not originally set out to roll out our own runtime. We just wanted people to run their PySpark jobs against warehouse tables with OpenLineage & OpenTelemetry turned on, but then spent too much time on eternally boring tasks like assembling the runtime, authenticating to the warehouse, loading the right dependencies, debating virtual environment bundling setups, configuring VPCs, egressing logs to CloudWatch, and adjusting PySpark code itself to emit useful telemetry. That initial burden was the product lesson. The hard part, for too many teams, is not writing the PySpark logic. It is making the execution environment, catalog, storage, and lineage transport agree with each other long enough for the useful work to begin.
So our managed serverless runtime is designed to be the structured starting point we wanted for ourselves. You should be able to bring a PySpark job, submit it through the CLI, and get lineage, logs, tables, and operational visibility without spending the remainder of the month negotiating with IAM permissions, dependency jars, warehouse paths, and transport layers.
And if your workload outgrows our managed runtime, you should be able to move. Your team might need persistent clusters for cache-heavy jobs, custom Spark images, private network routes, more strict tenancy controls, direct access to internal services, or executor-level tuning that may not be suitable for oleander. In that case, the path is not to rewrite the pipeline for oleander. It is to run Spark in your own environment, add the OpenLineage Spark integration, point it at the same catalog and lineage endpoint, and keep the graph intact while the compute moves under your control.
-- write your code, bundle up your env, and ship...oleander spark jobs submit entrypoint.py \ --namespace prod \ --name first_oleander_task \ --mode STREAMING \ --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.7,org.postgresql:postgresql:42.7.4 \ --waitA data lake without the hostage situation
The Lake side follows the same principle. We manage an Iceberg catalog for each organization because most teams do not want to begin their day by becoming catalog administrators and learning about the nuances of table compaction. Managed tables, query access, scheduled work, and lineage capture are all there from the start.
We provide unrestricted access to the oleander managed Iceberg catalog and let you bring your own if you're up for the infra work. The tables are Iceberg tables. The metadata is Iceberg metadata. If you need to connect outside systems, inspect metadata directly, bring your own query engine, or migrate into your own catalog strategy later, that is a supported path rather than an awkward negotiation.
-- Transfer all data from the oleander managed Iceberg table -- `oleander.default.sf_311` into a custom Iceberg catalog called `your_catalog.your_db.sf_311`-- There it is; you own your data...
CREATE TABLE my_catalog.my_db.sf_311 ASSELECT *FROM oleander.default.sf_311;The context graph is the product
The important layer is not that a Spark task ran in our runtime or in yours. The lasting durable entity is the knowledge graph created in the process. This is something that no amount of AI prompting can replace, since without the proper instrumentation and reasonable data modeling, no context is available for your LLM.
That means the graph implementation has to respect open boundaries too. Lineage should not disappear when compute changes hands. It should not fork into a private dialect whenever a team uses a different execution engine. It should be possible to reason about a table written by Spark, queried by DuckDB, or scheduled by dbt or Airflow.
Our bet is that the data warehouse should become less of a walled city and more of a cartography project with modular mappings.
Education is power
OpenLineage is literally just a schema, but it also contains its own lexicon and context-specific integrations. So a significant part of oleander is educational. We try to explain the tooling, publish examples, show how Spark instrumentation works, document Iceberg catalog access, and make the graph model feel less mystical. Our success depends on the community's willingness to adopt and learn the standard.
Here's what we have thus far:
- OpenLineage validator: Validate your OpenLineage events against ours, which has a fairly extensive test suite across several integrations.
- Lineage graph visualizer: See what a list of lineage events looks like visually.
- Parquet: Inspect and navigate Parquet schemas directly—helpful for debugging or learning the formats.
You can find and use all of these tools without registering (although that is free of course), so you can learn and experiment before committing anything. If you’re just getting started with OpenLineage, PySpark, or Iceberg and want to see how the pieces fit together, these tools are a great place to start.