Operations on Multiple Columns at Once

Transforming a dataframe, many columns in one command. Mapping.

Renaming

var newColumnNames = Seq("newName1", "newName2", "newName3")
df = df.toDF(newColumnNames: _*)

NOTE: To use this renaming method, the number of new columns must be the same as the original, i.e. you have to rename every column and/or keep names of the ones you don't want to change. And you have to mind the order!

To get the length of any List, Seq, or Array, use .size e.g. df.columns.size then you can visually check it worked, if original and new dataframes lined up correctly. You can check the type of any object using .getClass like so df.toDF(newColumnNames: _*).getClass.getName Credit

Another Way to Rename, Aliasing

Specifically if you want to add a prefix/suffix to a column name. This source mentions the example below, where the suffix is "_df1",

df1_r = df1.
select(*(col(x).alias(x + '_df1') for x in df1.columns))

Change Type

There are a couple of ways to change the column type, .cast is one of them. Below, we make a Sequence of Tuples, and directly map them to the dataframe.

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

var types = Seq(("col1","string"),("col2","timestamp"),("col3","string"),("col4","long"),("col5","double"), ("col6","double"))
df = df.select(types.map{case (c,t) => col(c).cast(t)}: _*)

☞ NOTE: There's a difference in how you declare types. Here, we use lower case, one word. If we're writing schemas, then we're using StructType and StructField, in which case "double" would be DoubleType(), "string" would be StringType(), so on and so forth. Answer Credit

☞ Read docs on package "org.apache.spark.sql.Column", .cast function there has the available data types to you, which are: string, boolean, byte, short, int, long, float, double, decimal, date, timestamp

Passing a List

Say you have a list of columns that you often select, then you can put their names in a list, then map a col function to them within the select statement. Like so,

import org.apache.spark.sql.functions._

//the list of column names you want
val wantedCols:List[String] = List("col1","col2",...)

//`map` a `col` function to them, and pass to `select`
df.select(wantedCols.map(col): _*).show

Remember the :_* operator sort of acts like an access tool for each element in the list/seq/array.

You can use the same technique for other functions, like drop ,

val wantedCols:List[String] = List("col1","col2",...)

df.drop(wantedCols: _*).show(false)

Credit

Filling Nulls

We can extend what we do for one column.

From this post,

  • Use another na.fill() function to fill in a different value, for another set of columns.

  • Use the same Seq method (or List or Array) whether there's one or many columns.

df.na.fill("value", Seq("col_1","col_2",..))
    .na.fill("another_value", Seq("col_14","col_39", ...))

☞ IMPORTANT: Type of the filled value must match the type of column(s) you're filling. This is specially important if you have a LongType or a big DoubleType column, and you want to substitute zeros in nulls. You need a BIGINT(0) or 0L instead of just 0 . Try both to see which one works; in case just 0 errored out.

To fill in multiple columns with the same value, use this,

val colNames= Seq[String]("col1","col2","col3") 
df.na.fill(valueTofill, colNames) 

Source mentions a different way to define a list of strings.

Other ways to achieve the same thing is found on the official Spark Scala API docs, searching for "org.apache.spark.sql.DataFrameNaFunctions" package.

Basically, you can also use the Map function,

df.na.fill(Map(
  "A" -> "unknown",
  "B" -> 1.0
))

Or

import com.google.common.collect.ImmutableMap;

df.na.fill(ImmutableMap.of("A", "unknown", "B", 1.0));

Apply Same Function

For all information about UDFs, read "User Defined Functions - UDF" page in this book.

credit answer

val columnsTomap = df.select("rbc", "cad", "rbc", "pe", "ane").columns

var tempdf = df
columnsTomap.map(column => {
  tempdf = tempdf.withColumn(column, applyFunction(col(column)))
})

tempdf.show(false)

☞ NOTE:

You can also use functions to alter several columns at once in the dataframe, and output a new dataframe. See how and full details in section "Functions on Data Frames" section of page "DataFrame Manipulations" of this book.

Last updated