Spark Scala Fundamentals

for loops, if statements, function definitions, lists, etc.

Control Flow and Functions

For-Loops

val myList:List[String] = List("apples","oranges","bananas")

for (item <- myList) {
     println(item + " is a fruit.")
     }

Replace in from Python with <- , add parenthsis around condition. Replace colon with braces.

For a numerical range in condition, use (i <- 1 to 9) for example.

If-else statements

val x:List[Int] = List(1,2,3,4,5)

for (item <- x) { 
    if(item <3 ){
        println("number is less than three")
        } else if (item == 3) {
            println("number is three")
            } else {
            println("number is bigger than three")
                    }  
}

Similar to for-loops, you need parentheses around condition, and braces around statements. I think else if and else should be on the same line immediately after the closing brace of the previous condition, not on a new line.

Functions

Here I put the native Scala functions, Not the Spark User Defined Functions, which has a page of its own in this book.

Best way to illustrate rules are through examples. I'll show you below how to generate a list of strings representing sequential dates from a given month and year, in the format "yyyy-mm-dd"

def genDates(numDays: Int, mm: Int, yyyy: Int): List[String] = {
    var days = List[String]()
    if (numDays < 9) {
      for (i <- 1 to numDays) {
        days = days :+ s"$yyyy-$mm-0$i" 
      }
    } else {
      var firstDays = List[String]()
      var lastDays = List[String]()
      for (i <- 1 to 9) {
        firstDays = firstDays :+ s"$yyyy-$mm-0$i"
      }
      for (i <- 10 to numDays) {
        lastDays = lastDays :+ s"$yyyy-$mm-$i"
      }
      days = firstDays ++ lastDays
    }
    return days
}

To use it,

genDates(numDays=7, mm=2, yyyy=2019)

Output: res0: List[String] = List(2019-2-01, 2019-2-02, 2019-2-03, 2019-2-04, 2019-2-05, 2019-2-06, 2019-2-07)

☞ NOTES:

  • You must define the types for each variable; as well as the type of the output of the function.

  • Use var for variables that are going to change value. Remember val is immutable.

  • Unlike Python, spacing doesn't matter in Spark Scala, though it's always a good practice to keep spacing neat for ease of reading and debugging.

  • Notice how we do text formatting here s"$value-text-$anotherValue" which is similar to Python's f"{value}-text-{another_value}"

  • If you want to use the function as a custom User Defined Function (UDF) on a dataframe column, which takes in Column values, then you have to register it with Spark beforehand. There's a dedicated page for that in this guide.

Functions on Data Frames

See "DataFrame Manipulations" page of this book.

Error Handling - Exceptions

Similar to try/except in Python, there's try/catch/finally in Spark Scala Source.

To catch unknown or general error, see this answer. The "case" statement doesn't have to be one-liner, it can have braces and multiple steps inside.

I had partitions by day, that is a separate folder for each day, then inside that, there are serveral partitions with a specific message type. I was dealing with event data,

//test for corrupt or non-existant files
for (d <- dates) {
    try {
        spark.read.parquet(s"$lake_path/data/PartitionDateKey=$d/PartitionTypeKey=telemetryEvent")
    } catch {
        case ae: org.apache.spark.sql.AnalysisException => println(s"$d not found")
    }
}

To catch any general error, I could replace the error name in the "case" clause with unknown: Exception:

for (d <- dates) {
    try {
        spark.read.parquet(s"$lake_path/data/PartitionDateKey=$d/PartitionTypeKey=telemetryEvent")
    } catch {
        case unknown: Exception => println(s"$d not found")
    }
}

Importing a Package (Scala File) You Made Earlier

Source

  1. Make your library in the format:

package logic
class Logic{
  def hello = "hello"
}

2. Call it from another script:

package runtime
import logic.Logic  // import
object Main extends Application{
println(new Logic hello) // instantiation and invocation
}

☞ The source has also the code to run a .scala file from terminal.

Lists

Create an Empty List

val a = List() returns List(Nothing) i.e. data type Nothing. or specify a type var a = List[String]() VERY IMPORTANT: we must use val if we intend to append to it.

Append to an Empty List

  • add to the beginning of the list

    ::= like so my_list ::= "new_element" Source

  • add to the end of the list

    my_list = my_list :+ "new_element" Source

    IMPORTNAT must update the list, or the new changes won't persist. Another way is to add two lists.

Create an Empty Sequence and Append to it

useful to create dataframe with .toDF()

With a for-loop example

var dfSeq : Seq[(String, String, Double, String, Double, String)] = Seq()

for (item <- devsList) {
    var itemDF = felix2.filter($"DeviceId"===item)
    var itemRes_asis = runML(itemDF, "perturbedY")
    var itemRes_log = runML(itemDF, "logY")
    dfSeq = dfSeq :+ itemRes_asis
    dfSeq = dfSeq :+ itemRes_log
}
var resDF = dfSeq.toDF("Device", "bestR2Model","r2Score", "bestMSEmodel", "MSEscore", "runtimeTreatment")
resDF.show

Idea adapted to Sequence, from Source Note: usually what works for Lists, works for Sequences. There are different approaches to create a data frame, listed here but the Sequence one is the easiest, avoids schemas.

Add Two Lists Together

  • 3 colons, firstList ::: secondList or

  • .concat List.concat(firstList, secondList), or

  • two pluses firstList ++ secondList

Writing and Reading a List To/From a Text File

Refer to "Writing and Reading a Text File" page in this book.

Creating a List of DataFrames, and Combine them Together

Refer to "DataFrame Manipulations" page of this book.

String Formatting

In Scala we have s" " which works similarly to f" " or " ".format() in Python. That is, you can add a variable to the text. e.g. s"We got $value inches of rain this month" Here, our variable is value .

The dollar sign $ is how you tell Scala this is a variable to insert.

NOTE: Use f" " format, Not the s" ", when you want to round decimals in println statement. Syntax to show a value as two decimal places, and change its type to DoubleType, all at once f"$value%1.2f".toDouble e.g. println(f"R-squared metric for LR $lrMetric%1.2f") to print: "R-squared metric for LR 68.02"

Last updated