Filtering & Nulls
.filter command, how to fill nulls
Filling Nulls
See "Operations on Multiple Columns at Once" and "Operations on One Column" pages of this book.
Filtering on Nulls or Non-Nulls
Use SQL context df.filter("colname IS NOT NULL")
or df.filter($"colname" != null)
of df.filter($"colname".isNotNull)
or .isNull
Several other ways shown in the answers of this post. Including using the when-otherwise clause, which is useful for conditional filling, df.withColumn("colname", when ($"colname".isNull, 0).otherwise(1)).show
Filtering Rules and Operators
You can use SQL context, all in one string, or methods on columns. For example, both df.filter("colname >0 AND colname <10")
and df.filter(($"colname">0) && ($"colname"<10))
will work.
Another Example, you can use Spark syntax df.filter($"colname" ==="something")
or SQL syntax df.filter("colname == 'something' ")
☞ NOTE: I found that you can't mix SQL context with Spark context in one filter statement. i.e. when binding them together with a logical operator like ||
or &&
, instead chain another filter clause after.
Filtering Multiple Columns at Once, For Same Conditions
We can map a formatted string to each column in the list, and pass that to filter,
What happened here?
First two lines generate result filterString: String = (col1 >=0.0 AND col1 <= 24.0) OR (col2 >=0.0 AND col2 <= 24.0) OR (col3 >=0.0 AND col3 <= 24.0) It mapped every element of the list to whatever in formatted string, and made the whole thing a string at the end. So now you can pass it as a SQL syntax to the .filter clause.
Another way, using .map
and .reduce
df.filter(colnames.map(name=> (col(name)>=0.1) && (col(name)<0.2) ).reduce(_ or _)).show
I didn't use this solution, because it didn't work properly on more than one column. Also, it's an overkill for something that can be done easy in .filter clause for one column.
💡 A Word of Advice
What you learned about logic and De Morgan Laws for "not", "or", and "and" in your programming classes applies here. But I always, always test it before I go with it. So for example, if you want to get no nulls or zeros, you would type df.filter("colName IS NOT NULL AND colName !=0")
Just test it out if before you run with it. Things like that; and ordering, can botch your entire work.
Some Other Useful Filtering Functions
Since Spark deals with SQL dataframes, you will see many similarities between the two languages when dealing with dataframes, you can even insert SQL expression and use it directly in filtering or selecting columns.
Some of the other very useful functions are like
, isin
, contains
, amont others. Here's how to use those,
Suppose you have a list called
sampleDevices
then you can filter out rows having it withdf.filter($"DeviceId".isin(sampleDevices: _*))
Suppse you're looking for rows of which a certain column contains a certain text, then you can use
df.filter($"DeviceId".contains("3A"))
Not Like: Have the "not" upfront,
df.filter(!$"dateHour".like("%2019-11-01%"))
There's also startswith
and endswith
functions to specify the wanted starting string and ending string, respectively.
☞ For full list of allowed operations on columns, visit Spark Official Docs page for Scala API, and type org.apache.spark.sql.Column
Filter on an Aggregate, and Return the Whole Row from the Dataframe
As I explain in "Aggregations" page of this book, and in this answer, the full one-line solution is,
Last updated