PySpark Functional Programming: Stop Writing Imperative Spark Pipelines

Fri, 10 Apr 2026 10:00:00 +0100

In my recent project I ran into a situation where I had to review a set of PySpark notebooks in Microsoft Fabric — 14 notebooks, some of them over 3000 lines long, hundreds of cells, multiple data domains crammed into a single file. The code worked, but reading it felt like archaeology. Every notebook started the same way: df = spark.read..., then df = df.withColumn(...) repeated dozens of times, sprinkled with display(df) calls and bare except: blocks. I kept asking myself — how did we end up writing Spark code like this?

Notebooks in Production vs Spark Job Definitions — Which One Should You Use?

Tue, 10 Mar 2026 18:00:00 +0100

While working on multiple projects across all major Spark platforms — Apache Spark, Databricks, Azure Synapse, and most recently Microsoft Fabric — I noticed one thing that stays remarkably consistent: notebooks are by far the most popular way developers write and run their PySpark code. They feel fast and interactive during development — and they are, for exploration. What most people do not realize until later is that notebooks carry hidden overhead that adds up in production. But by then, the code is already written in a notebook, so naturally… they just deploy the notebook(s). Most of the time without even knowing there are other — arguably better — alternatives. Or they know, but do not want to rewrite the code.

PySpark on KaPa Consulting

PySpark Functional Programming: Stop Writing Imperative Spark Pipelines

Notebooks in Production vs Spark Job Definitions — Which One Should You Use?