Introduction

and a disclaimer

This book is a collection of my observations and implementation while teaching myself Spark Scala; scouring the internet for answers. Huge thanks to the StackOverflow community, and all the generous contributors on all online forums, those guys are the unknown soldiers whom efforts we all thrive on.

I'm a data scientist trained in Python. When I first started working with Big Data, I had almost zero prior knowledge of Scala, and only theoretical understanding of distributed systems. You might see some parts of the code I share here inefficient or not as elegant as it can be, I welcome and encourage all constructive comments, and corrections; send me a message to my LinkedIn to continue adding to this open guide, and make it a better one for all of us out there working with Big Data. Let me know if you're interested in coediting this guidebook!

Although my updates to my GitHub repositories have petered out recently because I focused on learning Spark, and other Data Science topics; here are the links for sake of completeness, my GitHub pages portfolio, and my GitHub repositories.

How to Read This Book

Explanations, details, and the main ways to achieve somehting in Spark is found in the "Spark Scala" pages group. PySpark pages group has examples and applications in PySpark without deeper explanation. Refer to the "Spark Scala" for more details on a topic.

Note about Spark

A practical understanding of Spark, is that it mixes SQL, with methods called on objects in a Spark syntax, with a Scala or Python interface, with special hybrid commands to bridge the three; in addition to your cluster manager platform way of doing things. So there's a lot going on, and you will end up learning something new every day.

Note about PySpark, and my story with it

Since Spark is written in Scala (technically Spark is defined as a distributed computing framework built atop of Scala), PySpark is merely an interface for it with Python context. That is, you can write functions and everything else in Python, even convert to Pandas df and import Python native libraries. However, you will be using Spark context and syntax when dealing with the objects and data frames; which in turn are referred to as Spark SQL data frames, since you can use SQL syntax in many places as we will see later).

I started out with PySpark, then quickly changed to using Spark Scala for a year and a half, then went back to using PySpark, as my work requirements changed. The links to official docs will sort of reflect that, as you'll see reference to Scala's API from Spark 2.4.3 or 2.4.4; but reference to Python API from Spark 3.1, which was the latest version at the time of writing this book.

There's a lot of similarities in how the two interfaces work, but also major differences. Seems like Spark Scala is preferred because it's allegedly faster, and because it gets improvements and updates first from Spark developers. Although I'm a Pythonista, I prefer to write in Spark Scala, because it feels more native to Spark. Albeit, Scala is a huge pain to learn at first; once you do learn it enough to get all your needs done with it, you might appreciate its rigorous structure as I did; as PySpark seems illogical at times, since it combines Spark peculiarity with the fluid nature of Python.

Last updated