Lake is (a)live
WL

Willy Lulciuc

staff
Tags
duckdbopenlineagedata-lakesqlcollaboration

Lake with us

A few weeks ago, we released the precursor to this: our DuckDB WebAssembly Parquet viewer for public use. That project was born from the annoyance that there wasn’t a good way to read and filter large(ish) Parquet files in the browser that we could find.

Along the way, we realized DuckDB can run inside a serverless functions as ephemeral compute backed by object storage. We also found and augmented DuckLake, which fits our infrastructure a bit better than Iceberg or Delta for registering and tracking tables.

Now we have a place to just write SQL and have all your actions captured out of the box, for free, for all of history. You no longer have to wonder how data percolated from table to table; it’s all tracked.

Collaborative joins

We spent time perusing Hugging Face datasets to test some of our OpenLineage features on Spark and Airflow pipelines, and were left wondering: what if users could simply access shared datasets instead of duplicating them in private environments or lakes?

Within our lake, we support public and private datasets that can be shared across users and even organizations. Use of any dataset will always have a paper trail back to the original source as it was first contributed to oleander. That source table remains queryable and immutable once it’s added to the ecosystem, thus satisfying long term data fidelity.

There’s also voting and search to help people find datasets relevant to their purpose. This is partially a social experiment, but we're excited to see where it goes.

A few words on moderation

Because this is a light social network for data, we’ll keep an eye out for harmful datasets uploaded to the community. Low-quality datasets or public datasets orphaned for a long period of time will be cleared out of the lake according to our retention policy.

So please: upload data, query it, collaberate with others, and lay waste to it.

Lake is (a)live - oleander