Creating a Data Frame

How to make a data frame, and how to write its schema

Unlike Python, creating a data frame in Spark requires you to define its schema, which is the datatframe structure. i.e. the column names, types, and nullibitliy. To build that the right way, you need to know about Scala data types, properties and limitations of each. I find that confusing, but here's a couple of resources on Scala data types and how to write schemas; you can research more on your own.

Spark schemas, Spark data types - official docs, Scala data types another type I ran into is 'BigInt' from Scala's math library, but mostly LongType from Spark is good enough for big integers. Usually, using a .to*** on a number changes its type. Example, 19.toFloat gives 19.0 and 19.45.toInt gives 19. Marginal note: If you want to divide two numbers and get a decimal, one of the numbers has to be Float or Double type.

Creating a Data Frame Directly, Without a Function

There's a couple of ways to do it, but on general, creating a data frame has two parts in Spark Scala,

  • the Sequence of columns' values,

  • the schema of the data frame. i.e. column names and types.

The easiest way I found is here

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._ //for Row
import spark.implicits._

val inventory = Seq(
(10, "Finance", Seq(100, 200, 300, 400, 500)),
(20, "IT", Seq(10, 20, 50, 100))
).
toDF("dept_id", "dept_nm", "emp_details")

inventory.show(false) //"false" so we see full column, no trimming.

Output

more in depth tutorial. For making dataframe with Array columns, look here.

//create the columns' values
val arrayStructureData = Seq(
Row("James,,Smith",List(2,3, 5),List("Spark","Java"),"OH","CA"),
Row("Michael,Rose,",List(4,6,3),List("Spark","Java"),"NY","NJ"),
Row("Robert,,Williams",List(10,15, 6),List("Spark","Python"),"UT","NV")
)
  
//create the schema
val arrayStructureSchema = new StructType().
add("name",StringType).
add("num", ArrayType(IntegerType)).
add("languagesAtWork", ArrayType(StringType)).
add("currentState", StringType).
add("previousState", StringType)

//put it together
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema
df.show

Output

Creating a Data Frame With a Function

There are cases where you need to dynamically build a dataframe; like using a for-loop that calls a function, which builds a Sequence, then you convert those Sequences to a dataframe using toDF as shown above.

☞ See page "Machine Learning in Spark" page of this book, where I built a data frame to view and compare results of different models, on different data (dataframes).

Writing code in functions is one of the best coding practices you should implement; it makes your code reusable, packagable, clean, and easy to read and debug. Always write in functions.

Basically, here's the structure,

  1. Make a function to gather tha values you want, and outputs them in a tuple of the same type as your desired result data frame.

  2. Create an empty Sequence having the type of the each of the values.

  3. For-loop to call the function, and append result Tuple, and append results to the empty Sequence created in the previous step.

  4. Use the .toDF method and add column names for each of your values/columns.

Here's the generic syntax of the steps, I'm using 4 values of types (String, Double, String, Long) as an example

//1. the function
def myFunction(whatever_args_and_their_types): (String, Double, String, Long) = {
	
	//steps to get the wanted values
	
	//now put them in the tuple, in the desired format
	val resultTuple = (value1, f"$value2%1.2f".toDouble, value3, value4.toLong)
	
	return resultTuple
}

//2. prepare the dataframe column types
var dfSeq : Seq[(String, Double, String, Long)] = Seq() 

//3. build the for-loop
for (item <- listofDFs) {
	//get the tuple of wanted values from the function
	var theTuple = myFunction(whatever_args)
	//
	dfSeq = dfSeq :+ theTuple
}

//4. make the data frame
var resultDF = dfSeq.toDF("colName1","colName2","colName3","colName3")

resultDF.show

Convert a List to a Data Frame

I haven't needed this yet myself. Read this post for some ideas.

Last updated