Creating a Data Frame
How to make a data frame, and how to write its schema
Unlike Python, creating a data frame in Spark requires you to define its schema, which is the datatframe structure. i.e. the column names, types, and nullibitliy. To build that the right way, you need to know about Scala data types, properties and limitations of each. I find that confusing, but here's a couple of resources on Scala data types and how to write schemas; you can research more on your own.
Spark schemas, Spark data types - official docs, Scala data types another type I ran into is 'BigInt' from Scala's math library, but mostly LongType from Spark is good enough for big integers. Usually, using a .to***
on a number changes its type. Example, 19.toFloat
gives 19.0 and 19.45.toInt
gives 19.
Marginal note: If you want to divide two numbers and get a decimal, one of the numbers has to be Float or Double type.
Creating a Data Frame Directly, Without a Function
There's a couple of ways to do it, but on general, creating a data frame has two parts in Spark Scala,
the Sequence of columns' values,
the schema of the data frame. i.e. column names and types.
The easiest way I found is here
Output
more in depth tutorial. For making dataframe with Array columns, look here.
Output
Creating a Data Frame With a Function
There are cases where you need to dynamically build a dataframe; like using a for-loop that calls a function, which builds a Sequence, then you convert those Sequences to a dataframe using toDF
as shown above.
☞ See page "Machine Learning in Spark" page of this book, where I built a data frame to view and compare results of different models, on different data (dataframes).
Writing code in functions is one of the best coding practices you should implement; it makes your code reusable, packagable, clean, and easy to read and debug. Always write in functions.
Basically, here's the structure,
Make a function to gather tha values you want, and outputs them in a tuple of the same type as your desired result data frame.
Create an empty Sequence having the type of the each of the values.
For-loop to call the function, and append result Tuple, and append results to the empty Sequence created in the previous step.
Use the
.toDF
method and add column names for each of your values/columns.
Here's the generic syntax of the steps,
I'm using 4 values of types (String, Double, String, Long)
as an example
Convert a List to a Data Frame
I haven't needed this yet myself. Read this post for some ideas.
Last updated
Was this helpful?