for loops, if statements, function definitions, lists, etc.
Control Flow and Functions
For-Loops
val myList:List[String] = List("apples","oranges","bananas")
for (item <- myList) {
println(item + " is a fruit.")
}
Replace in from Python with <- , add parenthsis around condition. Replace colon with braces.
For a numerical range in condition, use (i <- 1 to 9) for example.
If-else statements
val x:List[Int] = List(1,2,3,4,5)
for (item <- x) {
if(item <3 ){
println("number is less than three")
} else if (item == 3) {
println("number is three")
} else {
println("number is bigger than three")
}
}
Similar to for-loops, you need parentheses around condition, and braces around statements. I think else if and else should be on the same line immediately after the closing brace of the previous condition, not on a new line.
Functions
Here I put the native Scala functions, Not the Spark User Defined Functions, which has a page of its own in this book.
Best way to illustrate rules are through examples. I'll show you below how to generate a list of strings representing sequential dates from a given month and year, in the format "yyyy-mm-dd"
You must define the types for each variable; as well as the type of the output of the function.
Use var for variables that are going to change value. Remember val is immutable.
Unlike Python, spacing doesn't matter in Spark Scala, though it's always a good practice to keep spacing neat for ease of reading and debugging.
Notice how we do text formatting here s"$value-text-$anotherValue" which is similar to Python's f"{value}-text-{another_value}"
If you want to use the function as a custom User Defined Function (UDF) on a dataframe column, which takes in Column values, then you have to register it with Spark beforehand. There's a dedicated page for that in this guide.
Functions on Data Frames
See "DataFrame Manipulations" page of this book.
Error Handling - Exceptions
Similar to try/except in Python, there's try/catch/finally in Spark Scala Source.
To catch unknown or general error, see this answer. The "case" statement doesn't have to be one-liner, it can have braces and multiple steps inside.
I had partitions by day, that is a separate folder for each day, then inside that, there are serveral partitions with a specific message type. I was dealing with event data,
To catch any general error, I could replace the error name in the "case" clause with unknown: Exception:
☞ The source has also the code to run a .scala file from terminal.
Lists
Create an Empty List
val a = List() returns List(Nothing) i.e. data type Nothing. or specify a type var a = List[String]()VERY IMPORTANT: we must use val if we intend to append to it.
IMPORTNAT must update the list, or the new changes won't persist. Another way is to add two lists.
Create an Empty Sequence and Append to it
useful to create dataframe with .toDF()
With a for-loop example
Idea adapted to Sequence, from Source Note: usually what works for Lists, works for Sequences. There are different approaches to create a data frame, listed here but the Sequence one is the easiest, avoids schemas.
Add Two Lists Together
3 colons, firstList ::: secondList or
.concat List.concat(firstList, secondList), or
two pluses firstList ++ secondList
Writing and Reading a List To/From a Text File
Refer to "Writing and Reading a Text File" page in this book.
Creating a List of DataFrames, and Combine them Together
Refer to "DataFrame Manipulations" page of this book.
String Formatting
In Scala we have s" " which works similarly to f" " or " ".format() in Python. That is, you can add a variable to the text. e.g. s"We got $value inches of rain this month"
Here, our variable is value .
The dollar sign $ is how you tell Scala this is a variable to insert.
☞ NOTE: Use f" " format, Not the s" ", when you want to round decimals in println statement.
Syntax to show a value as two decimal places, and change its type to DoubleType, all at once f"$value%1.2f".toDouble e.g. println(f"R-squared metric for LR $lrMetric%1.2f") to print: "R-squared metric for LR 68.02"
def genDates(numDays: Int, mm: Int, yyyy: Int): List[String] = {
var days = List[String]()
if (numDays < 9) {
for (i <- 1 to numDays) {
days = days :+ s"$yyyy-$mm-0$i"
}
} else {
var firstDays = List[String]()
var lastDays = List[String]()
for (i <- 1 to 9) {
firstDays = firstDays :+ s"$yyyy-$mm-0$i"
}
for (i <- 10 to numDays) {
lastDays = lastDays :+ s"$yyyy-$mm-$i"
}
days = firstDays ++ lastDays
}
return days
}
genDates(numDays=7, mm=2, yyyy=2019)
//test for corrupt or non-existant files
for (d <- dates) {
try {
spark.read.parquet(s"$lake_path/data/PartitionDateKey=$d/PartitionTypeKey=telemetryEvent")
} catch {
case ae: org.apache.spark.sql.AnalysisException => println(s"$d not found")
}
}
for (d <- dates) {
try {
spark.read.parquet(s"$lake_path/data/PartitionDateKey=$d/PartitionTypeKey=telemetryEvent")
} catch {
case unknown: Exception => println(s"$d not found")
}
}
package logic
class Logic{
def hello = "hello"
}
package runtime
import logic.Logic // import
object Main extends Application{
println(new Logic hello) // instantiation and invocation
}
var dfSeq : Seq[(String, String, Double, String, Double, String)] = Seq()
for (item <- devsList) {
var itemDF = felix2.filter($"DeviceId"===item)
var itemRes_asis = runML(itemDF, "perturbedY")
var itemRes_log = runML(itemDF, "logY")
dfSeq = dfSeq :+ itemRes_asis
dfSeq = dfSeq :+ itemRes_log
}
var resDF = dfSeq.toDF("Device", "bestR2Model","r2Score", "bestMSEmodel", "MSEscore", "runtimeTreatment")
resDF.show