📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page
  • Creating a Data Frame Directly, Without a Function
  • Creating a Data Frame With a Function
  • Convert a List to a Data Frame

Was this helpful?

  1. Spark Scala
  2. Spark Scala

Creating a Data Frame

How to make a data frame, and how to write its schema

PreviousSchema: Extracting, Reading, Writing to a Text FileNextEstimating, Partitioning, Writing/Saving a DataFrame

Last updated 4 years ago

Was this helpful?

Unlike Python, creating a data frame in Spark requires you to define its schema, which is the datatframe structure. i.e. the column names, types, and nullibitliy. To build that the right way, you need to know about Scala data types, properties and limitations of each. I find that confusing, but here's a couple of resources on Scala data types and how to write schemas; you can research more on your own.

, , another type I ran into is 'BigInt' from Scala's math library, but mostly LongType from Spark is good enough for big integers. Usually, using a .to*** on a number changes its type. Example, 19.toFloat gives 19.0 and 19.45.toInt gives 19. Marginal note: If you want to divide two numbers and get a decimal, one of the numbers has to be Float or Double type.

Creating a Data Frame Directly, Without a Function

There's a couple of ways to do it, but on general, creating a data frame has two parts in Spark Scala,

  • the Sequence of columns' values,

  • the schema of the data frame. i.e. column names and types.

The easiest way I found is

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._ //for Row
import spark.implicits._

val inventory = Seq(
(10, "Finance", Seq(100, 200, 300, 400, 500)),
(20, "IT", Seq(10, 20, 50, 100))
).
toDF("dept_id", "dept_nm", "emp_details")

inventory.show(false) //"false" so we see full column, no trimming.

Output

//create the columns' values
val arrayStructureData = Seq(
Row("James,,Smith",List(2,3, 5),List("Spark","Java"),"OH","CA"),
Row("Michael,Rose,",List(4,6,3),List("Spark","Java"),"NY","NJ"),
Row("Robert,,Williams",List(10,15, 6),List("Spark","Python"),"UT","NV")
)
  
//create the schema
val arrayStructureSchema = new StructType().
add("name",StringType).
add("num", ArrayType(IntegerType)).
add("languagesAtWork", ArrayType(StringType)).
add("currentState", StringType).
add("previousState", StringType)

//put it together
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema
df.show

Output

Creating a Data Frame With a Function

There are cases where you need to dynamically build a dataframe; like using a for-loop that calls a function, which builds a Sequence, then you convert those Sequences to a dataframe using toDF as shown above.

☞ See page "Machine Learning in Spark" page of this book, where I built a data frame to view and compare results of different models, on different data (dataframes).

Writing code in functions is one of the best coding practices you should implement; it makes your code reusable, packagable, clean, and easy to read and debug. Always write in functions.

Basically, here's the structure,

  1. Make a function to gather tha values you want, and outputs them in a tuple of the same type as your desired result data frame.

  2. Create an empty Sequence having the type of the each of the values.

  3. For-loop to call the function, and append result Tuple, and append results to the empty Sequence created in the previous step.

  4. Use the .toDF method and add column names for each of your values/columns.

Here's the generic syntax of the steps, I'm using 4 values of types (String, Double, String, Long) as an example

//1. the function
def myFunction(whatever_args_and_their_types): (String, Double, String, Long) = {
	
	//steps to get the wanted values
	
	//now put them in the tuple, in the desired format
	val resultTuple = (value1, f"$value2%1.2f".toDouble, value3, value4.toLong)
	
	return resultTuple
}

//2. prepare the dataframe column types
var dfSeq : Seq[(String, Double, String, Long)] = Seq() 

//3. build the for-loop
for (item <- listofDFs) {
	//get the tuple of wanted values from the function
	var theTuple = myFunction(whatever_args)
	//
	dfSeq = dfSeq :+ theTuple
}

//4. make the data frame
var resultDF = dfSeq.toDF("colName1","colName2","colName3","colName3")

resultDF.show

Convert a List to a Data Frame

. For making dataframe with Array columns, look .

I haven't needed this yet myself. Read this for some ideas.

Spark schemas
Spark data types - official docs
Scala data types
here
more in depth tutorial
here
post