Back to blogging

It’s been a while since I wrote anything here. My last post was back in mid-2024, and before that… well, let’s just say consistency wasn’t my strongest suit when it came to blogging 😊

If you’ve read any of my older posts, you probably noticed they were mostly about SQL Server — SSMS tips, query tricks, things I stumbled upon while working as a SQL Server and BI consultant. That world was my bread and butter for years, and I genuinely enjoyed sharing small discoveries that made my (and hopefully your) day-to-day work a bit easier.

What changed

Over the last couple of years, my professional life took a pretty significant turn. I moved from being primarily a SQL Server consultant to working as a data engineer in the Azure ecosystem. Databricks, Synapse, Microsoft Fabric — these became my daily tools. I’ve been working on multiple projects, sometimes as a consultant advising teams, but more often as a hands-on team member writing code alongside other engineers.

One thing I particularly loved about SQL Server was performance tuning — digging into execution plans, hunting down missing indexes, figuring out why a query that ran fine yesterday suddenly decided to go rogue. That detective work was genuinely one of my favorite parts of the job. And the good news? That mindset translates beautifully to PySpark. Understanding how Spark distributes work, why a particular join causes a shuffle, or how to read a query plan — it’s a different engine, but the same curiosity applies.

SQL Server is still part of my world, but marginally. The center of gravity shifted.

Why PySpark

The topic that excites me most right now is PySpark. Not in a “let me explain what a DataFrame is” kind of way — there are plenty of introductory resources out there. What I want to write about comes from my own experience of working with real production PySpark code across multiple teams and projects. The patterns I’ve seen, the mistakes we kept making (myself included), and the lessons learned along the way.

I’ve reviewed thousands of lines of PySpark notebooks across various projects, and I keep seeing the same patterns everywhere — code that’s hard to maintain, variables like df reused for completely unrelated datasets, and a general tendency to copy-paste rather than build reusable components. These aren’t signs of bad engineers — it’s just what happens when people focus on getting results fast without established conventions to guide them. I have opinions about this, and I want to share them 😊

A word about AI

I want to be upfront about something. This blog — and especially the upcoming PySpark series — is written with the help of AI tools. Drafting the posts, structuring — AI plays a role in all of that. I think it would be dishonest not to mention it.

That said, every post reflects my own experiences from working in PySpark environments. The anti-patterns I’ll describe? I’ve seen them (and sometimes written them) myself. The solutions? I’ve tested them in real projects. AI helps me create a blog posts more efficiently, but the substance comes from years of hands-on work. Everything gets reviewed and verified by me personally before it goes live.

What’s next

I’m starting a series on functional programming patterns in PySpark — or rather, on the imperative anti-patterns that plague most Spark notebooks and how to fix them. If you’re a data engineer working with PySpark in Fabric or Databricks, I think you’ll find it useful.

Thanks for reading, and welcome back — both of us 😊