Filtering & Nulls
.filter command, how to fill nulls
Last updated
Was this helpful?
.filter command, how to fill nulls
Last updated
Was this helpful?
See "Operations on Multiple Columns at Once" and "Operations on One Column" pages of this book.
Use SQL context df.filter("colname IS NOT NULL")
or df.filter($"colname" != null)
of df.filter($"colname".isNotNull)
or .isNull
Several other ways shown in the answers of this . Including using the when-otherwise clause, which is useful for conditional filling, df.withColumn("colname", when ($"colname".isNull, 0).otherwise(1)).show
You can use SQL context, all in one string, or methods on columns. For example, both df.filter("colname >0 AND colname <10")
and df.filter(($"colname">0) && ($"colname"<10))
will work.
Another Example, you can use Spark syntax df.filter($"colname" ==="something")
or SQL syntax df.filter("colname == 'something' ")
☞ NOTE: I found that you can't mix SQL context with Spark context in one filter statement. i.e. when binding them together with a logical operator like ||
or &&
, instead chain another filter clause after.
We can map a formatted string to each column in the list, and pass that to filter,
What happened here?
First two lines generate result filterString: String = (col1 >=0.0 AND col1 <= 24.0) OR (col2 >=0.0 AND col2 <= 24.0) OR (col3 >=0.0 AND col3 <= 24.0) It mapped every element of the list to whatever in formatted string, and made the whole thing a string at the end. So now you can pass it as a SQL syntax to the .filter clause.
Another way, using .map
and .reduce
df.filter(colnames.map(name=> (col(name)>=0.1) && (col(name)<0.2) ).reduce(_ or _)).show
I didn't use this solution, because it didn't work properly on more than one column. Also, it's an overkill for something that can be done easy in .filter clause for one column.
What you learned about logic and De Morgan Laws for "not", "or", and "and" in your programming classes applies here. But I always, always test it before I go with it. So for example, if you want to get no nulls or zeros, you would type df.filter("colName IS NOT NULL AND colName !=0")
Just test it out if before you run with it. Things like that; and ordering, can botch your entire work.
Since Spark deals with SQL dataframes, you will see many similarities between the two languages when dealing with dataframes, you can even insert SQL expression and use it directly in filtering or selecting columns.
Some of the other very useful functions are like
, isin
, contains
, amont others. Here's how to use those,
Suppose you have a list called sampleDevices
then you can filter out rows having it with df.filter($"DeviceId".isin(sampleDevices: _*))
Suppse you're looking for rows of which a certain column contains a certain text, then you can use df.filter($"DeviceId".contains("3A"))
Not Like: Have the "not" upfront, df.filter(!$"dateHour".like("%2019-11-01%"))
There's also startswith
and endswith
functions to specify the wanted starting string and ending string, respectively.
☞ For full list of allowed operations on columns, visit Spark Official Docs page for Scala API, and type org.apache.spark.sql.Column
As I explain in "Aggregations" page of this book, and in this , the full one-line solution is,