PySpark Functional Programming: Stop Writing Imperative Spark Pipelines

10-04-2026

General, PySpark, Functional Programming

In my recent project I ran into a situation where I had to review a set of PySpark notebooks in Microsoft Fabric — 14 notebooks, some of them over 3000 lines long, hundreds of cells, multiple data domains crammed into a single file. The code worked, but reading it felt like archaeology. Every notebook started the same way: df = spark.read..., then df = df.withColumn(...) repeated dozens of times, sprinkled with display(df) calls and bare except: blocks. I kept asking myself — how did we end up writing Spark code like this?

That question turned into a deep dive. I analyzed those 14 production notebooks line by line and found 78 distinct anti-pattern instances across 11 categories. The patterns were remarkably consistent — the same mistakes appearing in notebook after notebook, project after project. And the root cause was always the same: we’re writing imperative code on top of a framework that was designed to be functional.

This post is the introduction to a series where I’ll walk through the most common functional programming violations in PySpark pipelines and show how to fix them. But first, let’s understand why functional programming isn’t just academic theory for Spark — it’s how Spark was designed to work.

Spark was built for functional programming

Most data engineers don’t think of themselves as functional programmers, but every time you write a PySpark transformation, you’re already using functional concepts. Here’s why.

Spark DataFrames are immutable. When you call .filter(), .withColumn(), or .select(), Spark doesn’t modify the existing DataFrame — it creates a new one. The original remains unchanged. This is fundamental to Spark’s design since the RDD API days.

Spark uses lazy evaluation. Transformations don’t execute immediately — they build a DAG of operations. Nothing actually happens until you call an action like .collect(), .count(), or .write(). This means Spark’s Catalyst optimizer can rearrange, prune, and fuse your operations before a single byte is processed.

And since PySpark 3.0, we have the .transform() method — a built-in way to chain custom DataFrame → DataFrame functions. This is functional composition at the API level. Microsoft Fabric runs on Spark 3.4+, so this is available everywhere on the platform.

So if Spark is functional by design, why does most PySpark code look imperative?

Why we write imperative Spark code

A few reasons, actually. Notebook-driven development encourages mutable state — you run cell by cell, each one modifying df, and you mentally track what df currently holds. Python’s flexibility doesn’t help either — there’s no compiler telling you that reassigning df five times in a row is a bad idea. And there’s the copy-paste culture — when the first notebook works, its patterns get carried into the next one, imperative habits and all.

Most data engineers learn Python as an imperative language. Functional concepts like map, reduce, and composition rarely feature in a typical data engineering curriculum. So we end up writing Spark code the same way we’d write a procedural Python script — and that’s where the problems start.

The real cost

These aren’t just style issues. Imperative patterns cause concrete, measurable problems in production.

Look at how the same transformation pipeline reads in each style:

# Imperative — df is a moving target
df = spark.read.format("delta").load(path)
df = df.filter(col("status") == "active")
df = df.withColumn("full_name", concat(col("first"), lit(" "), col("last")))
df = df.withColumn("loaded_at", current_timestamp())
df = df.drop("_tmp")
display(df)  # triggers a Spark action/job — surprise!

# Functional — each step is named, no variable is ever reused
raw_customers = spark.read.format("delta").load(path)

active_customers = (
    raw_customers
    .filter(col("status") == "active")
    .withColumn("full_name", concat(col("first"), lit(" "), col("last")))
    .withColumn("loaded_at", current_timestamp())
    .drop("_tmp")
)

display(active_customers)  # caller decides when to materialise

When df is reassigned five times and a cell fails midway, df is in an unknown state. Re-running cells out of order produces silent wrong results. With the functional version, every intermediate name is stable — you can inspect raw_customers or active_customers independently, at any point.

Other costs I encountered across those 14 notebooks: functions that called display() internally couldn’t be unit tested at all. .rdd.isEmpty() forced an RDD conversion and could trigger a full table scan when .head(1) would often have been less work. The same 210-line dedup pattern was copy-pasted six times — when the logic needed updating, only two copies were fixed, and the other four silently produced wrong results. And 22 hardcoded abfss:// storage account paths meant the notebook could only ever run in the dev environment.

The core rule

If there’s one thing I want you to take away from this series, it’s this:

Never reassign a DataFrame variable. If you write df = df.withColumn(...), you’re fighting the paradigm.

This single rule — treating DataFrame variables as immutable — cascades into every other improvement. Once you stop reassigning df, you naturally gravitate toward method chaining, .transform() composition, pure functions, and testable code.

What this series covers

In the upcoming posts, we’ll tackle each category of anti-patterns one by one:

Immutability & Pure Transformations — why df = df.withColumn(...) is the most common anti-pattern and how method chaining with descriptive variable names fixes it
Eliminating Side Effects — why print() and display() inside transformation functions break Spark’s lazy model, and how Python’s logging module is the proper alternative
Composition with .transform() and reduce() — how to replace sequential df = f(df) calls with composable, testable, chainable pipelines
Performance Through Functional Design — 6 performance traps hidden in imperative code, from .rdd.isEmpty() to unnecessary .count() calls
Configuration, Naming & Error Handling — hardcoded paths, magic strings, bare except: blocks, and wildcard imports — the hygiene issues that make pipelines fragile
Functional Architecture: Pipeline Layers — how to split a 3000-line monolithic notebook into 5 composable layers: Ingestion → Cleansing → Business Logic → Aggregation → Output

Each post will follow the same pattern: show the anti-pattern from real production code, explain why it’s a problem, and present the idiomatic functional solution.

I hope this series will help some of you look at your PySpark notebooks with fresh eyes 😊

Thanks for reading!