Accessing Variables in Data Frames

How to grab columns, and items in them

col , $ , and Single Quote

To access a column, and apply functions to it, grab it with either col or $ with double quotations, or just a single quote. They're needed when you want to access a variable within the column, or when you need to apply a function on the column.

col is imported from org.apache.spark.sql.functions.col

Example,

//either of the following prints the column "dept_id"
df.select($"dept_id").show
df.select(col("dept_id")).show
df.select('dept_id).show

You can use it in functions too, e.g. df.select(max('dept_id)).show returns the max. Max and any other function needs to be imported from the functions library as well.

Using SQL Syntax

If you're more used to SQL, you can register the dataframe as a temporary table, then use SQL syntax on it directly with spark.sql statement. Read more on functions available for SQL syntax in the official Spark 2.4.3 SQL API docs. Here's an example below,

//register it as SQL table
df.createOrReplaceTempView("mydftable")

//use SQL syntax directly
val df2 = spark.sql("SELECT col1 AS c1, col2+col3 AS sumCol FROM mydftable ")
df2.show

For Nested/Multi-Level Columns

Sometimes you have nested columns, you'll know when you do a df.printSchema and you literally see a second level in the tree under a given column.

In this case, do df.select($"colName.varName") , and that can be as many levels as you want, so like colName.var1name.var2name...

Get an Element from an Array Column

Can be done with any column picking function like select, selectExpr, or withColumn

Two ways,

  • Directly index into it, df.select($"arrayColName"(0)) or df.selectExpr("arrayColName[0]") in SQL expression. See below for code examples.

  • Or use .getItem() like this df.withColumn("newCol", $"arrayColName".getItem(0)) Credit https://stackoverflow.com/a/52121738/11381214

Like Python, first item index is zero here.

💡 Pro Tip - Using 'alias' and 'as' When Selecting Elements from Array Columns

Without using a name to the newly extracted column from the array one, the name of the new column to be as long as the expression to get it.

To overcome that, you can use .alias or its SQL expression equivalent as .

//with `as`
df.select('arrayColName(0) as "first_item").show

//with `.alias`
df.select('arrayColName.getItem(0).alias("first_item")).show

//with selectExpr in SQL expression
df.selectExpr("arrayColName[0] AS first_item").show


//different items from same column
df.selectExpr("col1[0] AS first", "col1[1]" AS second, "col1[2]" AS third)

//or different items from different columns
df.selectExpr("col1[0] AS new_name", "col2[2]" AS new_name2, "col3[1]" AS new_name3)

☞ NOTE: Remember that a select statement can add modified columns and unmodified ones. If you're using $"" to select columns, paird with .alias on even one of the columns, then you must use the $for all columns, even if alias isn’t used, or if we don’t need to select elements from the rest of the columns.

i.e. this will work, df.select($"col1"(0).alias("first"), $"col2")

but the following won't, df.select($"col1"(0).alias("first"), "col2")

So you can make it easier for yourself, if you use the single quote to select and operate on columns instead, like this inventory.select('col1(0).alias("first"), 'col2).show Or, df.select('col1(0) as "first", 'col2).show

☞ For other needs to handle array columns, see "Array Columns" page in this book.

Last updated