Accessing Variables in Data Frames
How to grab columns, and items in them
col
, $
, and Single Quote
col
, $
, and Single QuoteTo access a column, and apply functions to it, grab it with either col
or $
with double quotations, or just a single quote. They're needed when you want to access a variable within the column, or when you need to apply a function on the column.
col
is imported from org.apache.spark.sql.functions.col
Example,
You can use it in functions too, e.g. df.select(max('dept_id)).show
returns the max. Max and any other function needs to be imported from the functions library as well.
Using SQL Syntax
If you're more used to SQL, you can register the dataframe as a temporary table, then use SQL syntax on it directly with spark.sql
statement. Read more on functions available for SQL syntax in the official Spark 2.4.3 SQL API docs. Here's an example below,
For Nested/Multi-Level Columns
Sometimes you have nested columns, you'll know when you do a df.printSchema
and you literally see a second level in the tree under a given column.
In this case, do df.select($"colName.varName")
, and that can be as many levels as you want, so like colName.var1name.var2name...
Get an Element from an Array Column
Can be done with any column picking function like select
, selectExpr
, or withColumn
Two ways,
Directly index into it,
df.select($"arrayColName"(0))
ordf.selectExpr("arrayColName[0]")
in SQL expression. See below for code examples.Or use
.getItem()
like thisdf.withColumn("newCol", $"arrayColName".getItem(0))
Credit https://stackoverflow.com/a/52121738/11381214
Like Python, first item index is zero here.
💡 Pro Tip - Using 'alias' and 'as' When Selecting Elements from Array Columns
Without using a name to the newly extracted column from the array one, the name of the new column to be as long as the expression to get it.
To overcome that, you can use .alias
or its SQL expression equivalent as
.
☞ NOTE: Remember that a select statement can add modified columns and unmodified ones. If you're using $""
to select columns, paird with .alias
on even one of the columns, then you must use the $
for all columns, even if alias isn’t used, or if we don’t need to select elements from the rest of the columns.
i.e. this will work, df.select($"col1"(0).alias("first"), $"col2")
but the following won't,
df.select($"col1"(0).alias("first"), "col2")
So you can make it easier for yourself, if you use the single quote to select and operate on columns instead, like this inventory.select('col1(0).alias("first"), 'col2).show
Or, df.select('col1(0) as "first", 'col2).show
☞ For other needs to handle array columns, see "Array Columns" page in this book.
Last updated