<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PySpark on KaPa Consulting</title><link>https://kapa-consulting.sk/categories/pyspark/</link><description>Recent content in PySpark on KaPa Consulting</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Fri, 10 Apr 2026 10:00:00 +0100</lastBuildDate><atom:link href="https://kapa-consulting.sk/categories/pyspark/index.xml" rel="self" type="application/rss+xml"/><item><title>PySpark Functional Programming: Stop Writing Imperative Spark Pipelines</title><link>https://kapa-consulting.sk/post/2026/04/2026-04-10-pyspark-functional-programming-intro/</link><pubDate>Fri, 10 Apr 2026 10:00:00 +0100</pubDate><guid>https://kapa-consulting.sk/post/2026/04/2026-04-10-pyspark-functional-programming-intro/</guid><description>&lt;p>In my recent project I ran into a situation where I had to review a set of PySpark notebooks in Microsoft Fabric — 14 notebooks, some of them over 3000 lines long, hundreds of cells, multiple data domains crammed into a single file. The code worked, but reading it felt like archaeology. Every notebook started the same way: &lt;code>df = spark.read...&lt;/code>, then &lt;code>df = df.withColumn(...)&lt;/code> repeated dozens of times, sprinkled with &lt;code>display(df)&lt;/code> calls and bare &lt;code>except:&lt;/code> blocks. I kept asking myself — how did we end up writing Spark code like this?&lt;/p></description></item><item><title>Notebooks in Production vs Spark Job Definitions — Which One Should You Use?</title><link>https://kapa-consulting.sk/post/2026/03/2026-03-24-pyspark-notebooks-in-production-vs-spark-jobs/</link><pubDate>Tue, 10 Mar 2026 18:00:00 +0100</pubDate><guid>https://kapa-consulting.sk/post/2026/03/2026-03-24-pyspark-notebooks-in-production-vs-spark-jobs/</guid><description>&lt;p>While working on multiple projects across all major Spark platforms — Apache Spark, Databricks, Azure Synapse, and most recently Microsoft Fabric — I noticed one thing that stays remarkably consistent: notebooks are by far the most popular way developers write and run their PySpark code. They feel fast and interactive during development — and they are, for exploration. What most people do not realize until later is that notebooks carry hidden overhead that adds up in production. But by then, the code is already written in a notebook, so naturally&amp;hellip; they just deploy the notebook(s). Most of the time without even knowing there are other — arguably better — alternatives. Or they know, but do not want to rewrite the code.&lt;/p></description></item></channel></rss>