i remain confuddled
PH

Peter Hicks

staff
Tags
opinionsatiresparkawsicebergEMR

I've been working adjacent to the data engineering realm for around 6 years now and am now certain that I know less than when I got started. The obsessive emergence of open table formats has broken my fragile spirit as a human person who aspires to have substandard USA freedom to die and/or get addicted to synthetic opioids™ healthcare for when they contract the superflu, an ear infection, & pneumonia at the same time and spend the better part of January 2026 in a fever dream semi-conscious state. Never have I felt more listless (unclear if related to my hallucinatory state to being) perusing endless series of AI proliferated documentation that strings buzzwords together better than any b-school middle manager I've ever been trapped by at a developer conference. I've been reduced to the point where I now describe myself as a gardener at the wondrous garden startup 'oleander', so at least don't have to palate the nonchalant resentment of strangers when sauntering around the botanical gardens in San Francisco for trampling upon the cultural fiber of the city.

The word salad of concatenations between data, house, ware, lake, table, format, catalog, and manifest has left me nonplussed as to how anyone finds meaning in any natural language at all. I have come full circle to the point where I don't wish to speak about software anymore since I'm fearful of joining the misinformation parade due to my own lack of understanding on these matters.

My frustration culminated when trying to deploy a spark cluster to EMR on AWS on top of Iceberg tables. It was a 3 day journey fraught with failure, seemingly random VPC settings, unsafely typed spark configuration smashed together with sketchy runtime Jar imports, IAM permission overrides, and 4 other separate AWS services that should be something standard for any organization. Amazon unironically calls their catalog service Glue as a ode to how they gave up attempting to have a reasonable fully cohesive product and instead ended up with a mess of services needed to catalog, query, map reduce, and store data. It left me with a lingering sensation of despair that this is all the industry is offering to people after the MapReduce revolution of yesteryear. You can find the blog post where I attempted to veil my contempt for this whole process: Chat stats with Spark & OpenLineage.

I tried to distill table functions into what I think is a reasonable experience with our oleander lake without the transcendental super genius aura some of our industry thought leaders. I am just your average, below average landscaper who prefers a zen garden I suppose. I think what we've got is just about the best place to get started to have an interactive, fully featured data lake that you can grow with as you learn more.

I'm at the point where I’m considering rf -rfing the entire roadmap of ours in favor of just managing Spark to mostly save myself and others from the misery of the AWS console, helm charts, k8s configs, and the horror of Terraform files.

i think i will release this in the coming weeks... stay tuned...

PS: I'm also adding some superserious, supercilious, & supersillious company goals:

  1. Getting oleander employees to correctly not capitalize the o in oleander and subsequently not capitalize random nouns like data engineer. until then i will indiscriminately not capitalize letters that should be capitalized to compensate for this.
  2. Find a few tormented souls like myself that have experienced these same Spark & Iceberg configuration nightmares.
  3. Have a product that does not require configs.
  4. Stumble upon some small semblance of authenticity still lingering on the internet in my profession.
  5. Convincing my co-workers that using a AI (or the even worse agentic) as an adjective is a violent form of customer antipathy. oleander customers are aware enough to know that LLM magic 8 ball context injection is not an accurate enough solution for the data lineage and provenance and pretending it is is disingenuous. Guarantees require data modeling, caching, query optimization and all the other stuff that we're been doing in our careers as engineers.