Filtering & Nulls

.filter command, how to fill nulls

Filling Nulls

See "Operations on Multiple Columns at Once" and "Operations on One Column" pages of this book.

Filtering on Nulls or Non-Nulls

Use SQL context df.filter("colname IS NOT NULL") or df.filter($"colname" != null) of df.filter($"colname".isNotNull) or .isNull

Several other ways shown in the answers of this post. Including using the when-otherwise clause, which is useful for conditional filling, df.withColumn("colname", when ($"colname".isNull, 0).otherwise(1)).show

Filtering Rules and Operators

You can use SQL context, all in one string, or methods on columns. For example, both df.filter("colname >0 AND colname <10") and df.filter(($"colname">0) && ($"colname"<10)) will work. Another Example, you can use Spark syntax df.filter($"colname" ==="something") or SQL syntax df.filter("colname == 'something' ")

☞ NOTE: I found that you can't mix SQL context with Spark context in one filter statement. i.e. when binding them together with a logical operator like || or && , instead chain another filter clause after.

Filtering Multiple Columns at Once, For Same Conditions

We can map a formatted string to each column in the list, and pass that to filter,

//columns we need to filter on same conditions
val colnames = List[String]("col1","col2","col3")

//create the SQL string
val filterString = colnames.map(c => s"($c >=0.0 AND  $c <= 24.0)").mkString(" OR ")

//apply it
df.filter(filterString).show

Credit

What happened here?

First two lines generate result filterString: String = (col1 >=0.0 AND col1 <= 24.0) OR (col2 >=0.0 AND col2 <= 24.0) OR (col3 >=0.0 AND col3 <= 24.0) It mapped every element of the list to whatever in formatted string, and made the whole thing a string at the end. So now you can pass it as a SQL syntax to the .filter clause.

Another way, using .map and .reduce

df.filter(colnames.map(name=> (col(name)>=0.1) && (col(name)<0.2) ).reduce(_ or _)).show I didn't use this solution, because it didn't work properly on more than one column. Also, it's an overkill for something that can be done easy in .filter clause for one column.

💡 A Word of Advice

What you learned about logic and De Morgan Laws for "not", "or", and "and" in your programming classes applies here. But I always, always test it before I go with it. So for example, if you want to get no nulls or zeros, you would type df.filter("colName IS NOT NULL AND colName !=0")

Just test it out if before you run with it. Things like that; and ordering, can botch your entire work.

Some Other Useful Filtering Functions

Since Spark deals with SQL dataframes, you will see many similarities between the two languages when dealing with dataframes, you can even insert SQL expression and use it directly in filtering or selecting columns. Some of the other very useful functions are like, isin, contains , amont others. Here's how to use those,

  • Suppose you have a list called sampleDevices then you can filter out rows having it with df.filter($"DeviceId".isin(sampleDevices: _*))

  • Suppse you're looking for rows of which a certain column contains a certain text, then you can use df.filter($"DeviceId".contains("3A"))

  • Not Like: Have the "not" upfront, df.filter(!$"dateHour".like("%2019-11-01%"))

There's also startswith and endswith functions to specify the wanted starting string and ending string, respectively.

☞ For full list of allowed operations on columns, visit Spark Official Docs page for Scala API, and type org.apache.spark.sql.Column

Filter on an Aggregate, and Return the Whole Row from the Dataframe

As I explain in "Aggregations" page of this book, and in this answer, the full one-line solution is,

df.
filter($"col1" === df.select(max($"col1")).
first.getDouble(0)).
show

Last updated