Essential Imports & General Notes

Essential packages to import. General notes.

Get the full details in the "Reading Files, Essential Imports & Docs" page, in "Spark Scala" page group in this book. Below I'm highlighting specific details about PySpark.

Most Essential Libraries to Import

from pyspark.sql.types import *
from pyspark.sql.functions import *

Now you'll see that many resources, if not all, do the functions import with an alias either as "f" or "F", like so,

import pyspark.sql.functions as f #or F

Although this means more typing, you can argue that this might make your notebook/code more readable/debuggable. This might be even required as coding convention in your company.

If you do import column functions library as "f", don't forget that every time you call a function, you need to add f. beforehand. e.g. f.col()

Basic Information About the Dataframe

df.printSchema()

df.dtypes

General Notes

Calling Methods / Functions

In PySpark, syntax is Spark, with Python flavor. And as such, you must add the empty parentheses after each function. For example, to print the names and types of columns of a data frame, you'd type df.printSchema() in PySpark. Instead of just df.printSchema in Spark Scala.

Sometimes, the same function can be used in Scala's camelCase, or Python's snake_case. e.g. dropDuplicates() function; where drop_duplicates() is an alias for dropDuplicates() meaning you can use either. Parentheses are still required in PySpark in either case.

Reading the Newly Vamped Docs 3.1

I admit it's irritating at first to navigat the new docs for either APIs. So here is how it works,

  • You still need to know where to look beforehand in terms of API, and package (module). Python API means PySpark, and Scala API means Spark Scala.

  • Googling "spark docs latest", or "spark 3.1 official docs" will lead you to the familiar page https://spark.apache.org/docs/latest/ with main tabs on top. Hovering over "API Docs" tab shows a dropdown list of the APIs. Choose Python for Pyspark, and Scala for Spark Scala.

  • Let's choose Python. That will lead you to another confusing page https://spark.apache.org/docs/latest/api/python/index.html click on "API Reference" tab up top. There! you have all modules (packages) laid out nicely. Select "Functions" from under "Spark SQL" to see all available column functions for you. or select "Feature" from under "MLlib (DataFrame-based)" to see details about feature engineering tools for ML, like Bucketizer, StandardScaler, VectorAssembler, etc. See "Machine Learning with PySpark" page in "PySpark" page group of this guidebook for examples and details.

Note about persist

df.persist() doesn't actually persist until an action is called on the persisted object. Like a .show or .count. After that, it does any action very fast. .persist is similar to .cache(), and we can both saved objects with either with the command .unpersist()

To Save, Write, Rewrite a PySpark Dataframe

df.write.mode("overwrite").option("header","true").format("parquet").save("blob_storage_path_here/chosen_file_name_my_initials")

It's a good idea to have an expressive name, include the date (so you know which version of the dataframe is this), and end the file name with my initials in capitals (such that you can locate them easily later on while browsing the blob storage path)

Read CSV files into a Spark df

spark.read.format("csv").load("path", inferSchema=True, header=True)

As always in Spark, the path can be a folder of similar CSVs, or similar parquets. i.e. partitions of the same file when saved.

Last updated