Spark Scala Fundamentals
for loops, if statements, function definitions, lists, etc.
Control Flow and Functions
For-Loops
Replace in
from Python with <-
, add parenthsis around condition. Replace colon with braces.
For a numerical range in condition, use (i <- 1 to 9)
for example.
If-else statements
Similar to for-loops, you need parentheses around condition, and braces around statements. I think else if
and else
should be on the same line immediately after the closing brace of the previous condition, not on a new line.
Functions
Here I put the native Scala functions, Not the Spark User Defined Functions, which has a page of its own in this book.
Best way to illustrate rules are through examples. I'll show you below how to generate a list of strings representing sequential dates from a given month and year, in the format "yyyy-mm-dd"
To use it,
Output: res0: List[String] = List(2019-2-01, 2019-2-02, 2019-2-03, 2019-2-04, 2019-2-05, 2019-2-06, 2019-2-07)
☞ NOTES:
You must define the types for each variable; as well as the type of the output of the function.
Use
var
for variables that are going to change value. Rememberval
is immutable.Unlike Python, spacing doesn't matter in Spark Scala, though it's always a good practice to keep spacing neat for ease of reading and debugging.
Notice how we do text formatting here
s"$value-text-$anotherValue"
which is similar to Python'sf"{value}-text-{another_value}"
If you want to use the function as a custom User Defined Function (UDF) on a dataframe column, which takes in Column values, then you have to register it with Spark beforehand. There's a dedicated page for that in this guide.
Functions on Data Frames
See "DataFrame Manipulations" page of this book.
Error Handling - Exceptions
Similar to try/except in Python, there's try/catch/finally in Spark Scala Source.
To catch unknown or general error, see this answer. The "case" statement doesn't have to be one-liner, it can have braces and multiple steps inside.
I had partitions by day, that is a separate folder for each day, then inside that, there are serveral partitions with a specific message type. I was dealing with event data,
To catch any general error, I could replace the error name in the "case" clause with unknown: Exception
:
Importing a Package (Scala File) You Made Earlier
Make your library in the format:
2. Call it from another script:
☞ The source has also the code to run a .scala file from terminal.
Lists
Create an Empty List
val a = List()
returns List(Nothing) i.e. data type Nothing. or specify a type var a = List[String]()
VERY IMPORTANT: we must use val
if we intend to append to it.
Append to an Empty List
add to the beginning of the list
::=
like somy_list ::= "new_element"
Sourceadd to the end of the list
my_list = my_list :+ "new_element"
SourceIMPORTNAT must update the list, or the new changes won't persist. Another way is to add two lists.
Create an Empty Sequence and Append to it
useful to create dataframe with .toDF()
With a for-loop example
Idea adapted to Sequence, from Source Note: usually what works for Lists, works for Sequences. There are different approaches to create a data frame, listed here but the Sequence one is the easiest, avoids schemas.
Add Two Lists Together
3 colons,
firstList ::: secondList
or.concat
List.concat(firstList, secondList)
, ortwo pluses
firstList ++ secondList
Writing and Reading a List To/From a Text File
Refer to "Writing and Reading a Text File" page in this book.
Creating a List of DataFrames, and Combine them Together
Refer to "DataFrame Manipulations" page of this book.
String Formatting
In Scala we have s" "
which works similarly to f" "
or " ".format()
in Python. That is, you can add a variable to the text. e.g. s"We got $value inches of rain this month"
Here, our variable is value
.
The dollar sign $
is how you tell Scala this is a variable to insert.
☞ NOTE: Use f" "
format, Not the s" "
, when you want to round decimals in println
statement.
Syntax to show a value as two decimal places, and change its type to DoubleType, all at once f"$value%1.2f".toDouble
e.g. println(f"R-squared metric for LR $lrMetric%1.2f")
to print: "R-squared metric for LR 68.02"
Last updated