📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page

Was this helpful?

  1. PySpark

PySpark

Applications and Examples

PreviousCatch-It-All PageNextEssential Imports & General Notes

Last updated 4 years ago

Was this helpful?

Like I said in Spark Scala page group, the native Spark language is Scala, and so I have all I know about the why and how there. In this group, I will only include applications and examples.

Since both PySpark and Spark (Scala) are merely different interfaces for the same framework, you will see a lot of similarities between the two. Generally speaking -take that with a grain of salt- whatever sytax works in Spark Scala usually works in PySpark. One major difference; in PySpark you have to add ()at the end of every method. e.g. printSchema in Scala now is printSchema() in PySpark. This applies to chained methods as well. e.g. Scala's df.groupBy("col1").count.show becomes df.groupBy("col1").count().show() in PySpark.

You can of course use single quotes for the columns in PySpark; I prefer to keep the double quotes to keep it more similar to Scala.

Your best bet is always visiting and revisiting the official docs for whatever version of Spark your cluster manager has, then go to the appropriate API Docs (there's a tab for it up top on the official docs page). For example, to get to PySpark official docs for Spark 2.4.4, I'd go here, or the latest version 3.1

By the time I wrote this page, Spark 3.1.1 was already out and pushed to DataBricks, a cluster manager platform for Apache Spark.

https://spark.apache.org/docs/2.4.4/api/python/index.html
https://spark.apache.org/docs/latest/api/python/reference/index.html