📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page
  • Filling Nulls
  • Filtering on Nulls or Non-Nulls
  • Filtering Rules and Operators
  • Filtering Multiple Columns at Once, For Same Conditions
  • Some Other Useful Filtering Functions
  • Filter on an Aggregate, and Return the Whole Row from the Dataframe

Was this helpful?

  1. Spark Scala
  2. Spark Scala

Filtering & Nulls

.filter command, how to fill nulls

PreviousOperations on Multiple Columns at OnceNextWindow

Last updated 4 years ago

Was this helpful?

Filling Nulls

See "Operations on Multiple Columns at Once" and "Operations on One Column" pages of this book.

Filtering on Nulls or Non-Nulls

Use SQL context df.filter("colname IS NOT NULL") or df.filter($"colname" != null) of df.filter($"colname".isNotNull) or .isNull

Several other ways shown in the answers of this . Including using the when-otherwise clause, which is useful for conditional filling, df.withColumn("colname", when ($"colname".isNull, 0).otherwise(1)).show

Filtering Rules and Operators

You can use SQL context, all in one string, or methods on columns. For example, both df.filter("colname >0 AND colname <10") and df.filter(($"colname">0) && ($"colname"<10)) will work. Another Example, you can use Spark syntax df.filter($"colname" ==="something") or SQL syntax df.filter("colname == 'something' ")

☞ NOTE: I found that you can't mix SQL context with Spark context in one filter statement. i.e. when binding them together with a logical operator like || or && , instead chain another filter clause after.

Filtering Multiple Columns at Once, For Same Conditions

We can map a formatted string to each column in the list, and pass that to filter,

//columns we need to filter on same conditions
val colnames = List[String]("col1","col2","col3")

//create the SQL string
val filterString = colnames.map(c => s"($c >=0.0 AND  $c <= 24.0)").mkString(" OR ")

//apply it
df.filter(filterString).show

What happened here?

First two lines generate result filterString: String = (col1 >=0.0 AND col1 <= 24.0) OR (col2 >=0.0 AND col2 <= 24.0) OR (col3 >=0.0 AND col3 <= 24.0) It mapped every element of the list to whatever in formatted string, and made the whole thing a string at the end. So now you can pass it as a SQL syntax to the .filter clause.

Another way, using .map and .reduce

df.filter(colnames.map(name=> (col(name)>=0.1) && (col(name)<0.2) ).reduce(_ or _)).show I didn't use this solution, because it didn't work properly on more than one column. Also, it's an overkill for something that can be done easy in .filter clause for one column.

💡 A Word of Advice

What you learned about logic and De Morgan Laws for "not", "or", and "and" in your programming classes applies here. But I always, always test it before I go with it. So for example, if you want to get no nulls or zeros, you would type df.filter("colName IS NOT NULL AND colName !=0")

Just test it out if before you run with it. Things like that; and ordering, can botch your entire work.

Some Other Useful Filtering Functions

Since Spark deals with SQL dataframes, you will see many similarities between the two languages when dealing with dataframes, you can even insert SQL expression and use it directly in filtering or selecting columns. Some of the other very useful functions are like, isin, contains , amont others. Here's how to use those,

  • Suppose you have a list called sampleDevices then you can filter out rows having it with df.filter($"DeviceId".isin(sampleDevices: _*))

  • Suppse you're looking for rows of which a certain column contains a certain text, then you can use df.filter($"DeviceId".contains("3A"))

  • Not Like: Have the "not" upfront, df.filter(!$"dateHour".like("%2019-11-01%"))

There's also startswith and endswith functions to specify the wanted starting string and ending string, respectively.

☞ For full list of allowed operations on columns, visit Spark Official Docs page for Scala API, and type org.apache.spark.sql.Column

Filter on an Aggregate, and Return the Whole Row from the Dataframe

df.
filter($"col1" === df.select(max($"col1")).
first.getDouble(0)).
show

As I explain in "Aggregations" page of this book, and in this , the full one-line solution is,

post
Credit
answer