Creating a Data Frame
How to make a data frame, and how to write its schema
Last updated
Was this helpful?
How to make a data frame, and how to write its schema
Last updated
Was this helpful?
Unlike Python, creating a data frame in Spark requires you to define its schema, which is the datatframe structure. i.e. the column names, types, and nullibitliy. To build that the right way, you need to know about Scala data types, properties and limitations of each. I find that confusing, but here's a couple of resources on Scala data types and how to write schemas; you can research more on your own.
, , another type I ran into is 'BigInt' from Scala's math library, but mostly LongType from Spark is good enough for big integers. Usually, using a .to***
on a number changes its type. Example, 19.toFloat
gives 19.0 and 19.45.toInt
gives 19.
Marginal note: If you want to divide two numbers and get a decimal, one of the numbers has to be Float or Double type.
There's a couple of ways to do it, but on general, creating a data frame has two parts in Spark Scala,
the Sequence of columns' values,
the schema of the data frame. i.e. column names and types.
The easiest way I found is
Output
Output
There are cases where you need to dynamically build a dataframe; like using a for-loop that calls a function, which builds a Sequence, then you convert those Sequences to a dataframe using toDF
as shown above.
☞ See page "Machine Learning in Spark" page of this book, where I built a data frame to view and compare results of different models, on different data (dataframes).
Writing code in functions is one of the best coding practices you should implement; it makes your code reusable, packagable, clean, and easy to read and debug. Always write in functions.
Basically, here's the structure,
Make a function to gather tha values you want, and outputs them in a tuple of the same type as your desired result data frame.
Create an empty Sequence having the type of the each of the values.
For-loop to call the function, and append result Tuple, and append results to the empty Sequence created in the previous step.
Use the .toDF
method and add column names for each of your values/columns.
Here's the generic syntax of the steps,
I'm using 4 values of types (String, Double, String, Long)
as an example
. For making dataframe with Array columns, look .
I haven't needed this yet myself. Read this for some ideas.