Operations on Multiple Columns at Once
Transforming a dataframe, many columns in one command. Mapping.
Last updated
Was this helpful?
Transforming a dataframe, many columns in one command. Mapping.
Last updated
Was this helpful?
NOTE: To use this renaming method, the number of new columns must be the same as the original, i.e. you have to rename every column and/or keep names of the ones you don't want to change. And you have to mind the order!
To get the length of any List, Seq, or Array, use .size
e.g. df.columns.size
then you can visually check it worked, if original and new dataframes lined up correctly.
You can check the type of any object using .getClass
like so df.toDF(newColumnNames: _*).getClass.getName
Specifically if you want to add a prefix/suffix to a column name. This mentions the example below, where the suffix is "_df1",
There are a couple of ways to change the column type, .cast
is one of them. Below, we make a Sequence of Tuples, and directly map them to the dataframe.
☞ NOTE: There's a difference in how you declare types. Here, we use lower case, one word. If we're writing schemas, then we're using StructType and StructField, in which case "double" would be DoubleType(), "string" would be StringType(), so on and so forth.
☞ Read docs on package "org.apache.spark.sql.Column", .cast
function there has the available data types to you, which are: string
, boolean
, byte
, short
, int
, long
, float
, double
, decimal
, date
, timestamp
Say you have a list of columns that you often select, then you can put their names in a list, then map a col
function to them within the select
statement. Like so,
Remember the :_*
operator sort of acts like an access tool for each element in the list/seq/array.
You can use the same technique for other functions, like drop
,
We can extend what we do for one column.
Use another na.fill()
function to fill in a different value, for another set of columns.
Use the same Seq method (or List or Array) whether there's one or many columns.
☞ IMPORTANT: Type of the filled value must match the type of column(s) you're filling.
This is specially important if you have a LongType or a big DoubleType column, and you want to substitute zeros in nulls. You need a BIGINT(0)
or 0L
instead of just 0
. Try both to see which one works; in case just 0 errored out.
To fill in multiple columns with the same value, use this,
Other ways to achieve the same thing is found on the official Spark Scala API docs, searching for "org.apache.spark.sql.DataFrameNaFunctions" package.
Basically, you can also use the Map function,
Or
For all information about UDFs, read "User Defined Functions - UDF" page in this book.
You can also use functions to alter several columns at once in the dataframe, and output a new dataframe. See how and full details in section "Functions on Data Frames" section of page "DataFrame Manipulations" of this book.
From post,
mentions a different way to define a list of strings.