Spark & Iceberg
PH

Peter Hicks

staff
Tags
sparkicebergopenlineageoleanderdata-engineering

After a January riddled with pneumonia, folly & an elongated period of decerealization (The inability to taste cerial) due to the aforementioned ailments, we are pleased to finally bring our managed Spark & Iceberg to the masses! For a little more context on the chaos before this launch, see i remain confuddled. This is the culination of years of work in the Open Source community with OpenLineage. This is the first solution that we know of to unify compute, OpenLineage metadata, data (catalogs and a query engine), logs, and observability under a single umbrella.

CLI

The primary access point for running Spark workflows is through our new CLI that is synced to oleander pipelines & lineage by default. It's never been easier to get a powerful PySpark execution environment. In addition to managing Spark, you can also run queries, connect your catalogs to your own DuckDB instance. For more information, check out our documentation.

# install the cli brew tap OleanderHQ/tap brew install oleander-cli # yep, that's kind of it... bring your spark task and run with it. oleander spark jobs submit process_poisonous_flowers --wait

Lake lineage compatibility

Datasets you interact with in Spark are exactly the same as the ones you interact with our data lake providing full lineage coverage across all data operations inside the oleander system. When you write or schedule a query the provinence of the output is kept track of for all of time, even if it is never used again. Any query that is written in Spark or raw DuckDB SQL is instrumented and has lineage derived on your behalf.

-- just write queries on spark derived datasets in Iceberg SELECT f.species_name, f.toxicity_level, r.region_name, c.classification, COUNT(e.incident_id) AS incident_count, AVG(e.severity_score) AS avg_severity FROM iceberg.default.flowers f JOIN iceberg.default.regions r ON f.region_id = r.id LEFT JOIN iceberg.default.classifications c ON f.classification_id = c.id LEFT JOIN iceberg.default.incidents e ON f.species_id = e.species_id WHERE f.toxicity_level >= 7 AND r.climate_zone = 'temperate' GROUP BY f.species_name, f.toxicity_level, r.region_name, c.classification HAVING COUNT(e.incident_id) > 0 ORDER BY f.toxicity_level DESC, avg_severity DESC LIMIT 10;

Bring your our catalog (or not)

We provide both a DuckLake and Iceberg catalog for each oleander organization, but that doesn't inhibit users from adding their own new or existing self managaged catalogs and scheduling syncs for existing data into the oleander platform. At present, we can only interact with Iceberg catalogs within Spark, but that should be a powerful enough system to run a complex data operations.

Give it a try

There is no commitment needed to try oleander out. We're confident that you'll love the oleander ecosystem much more than EMR or Dataproc as we do for our oun workflows and queries.