PySpark

Applications and Examples

Like I said in Spark Scala page group, the native Spark language is Scala, and so I have all I know about the why and how there. In this group, I will only include applications and examples.

Since both PySpark and Spark (Scala) are merely different interfaces for the same framework, you will see a lot of similarities between the two. Generally speaking -take that with a grain of salt- whatever sytax works in Spark Scala usually works in PySpark. One major difference; in PySpark you have to add ()at the end of every method. e.g. printSchema in Scala now is printSchema() in PySpark. This applies to chained methods as well. e.g. Scala's df.groupBy("col1").count.show becomes df.groupBy("col1").count().show() in PySpark.

You can of course use single quotes for the columns in PySpark; I prefer to keep the double quotes to keep it more similar to Scala.

Your best bet is always visiting and revisiting the official docs for whatever version of Spark your cluster manager has, then go to the appropriate API Docs (there's a tab for it up top on the official docs page). For example, to get to PySpark official docs for Spark 2.4.4, I'd go here, https://spark.apache.org/docs/2.4.4/api/python/index.html or the latest version 3.1 https://spark.apache.org/docs/latest/api/python/reference/index.html

By the time I wrote this page, Spark 3.1.1 was already out and pushed to DataBricks, a cluster manager platform for Apache Spark.

Last updated