Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
Linear correlations are an important part of the EDA step. This page shows you how to compute Pearson and Spearman correlations in PySpark to make sense of them; in addition to quickly isolating those columns with linear correlation value greater than a certain threshold we want.
Correlations in PySpark
Covert correlations dataframes to Pandas dataframes, and take absolute values of correlations
Select values with a specific threshold i.e. with chosen level of correlation value
The technique below works when selecting values in a correlation dataframe that are above a certain correlation value; then returns it as a list of tuples, such that we can convert the dataframe back into a Spark df.
This function can be used not only with correlations, but also to select values from any Pandas dataframe with respect to a specific numeric variable/column.
With a slight modification, you can also combine it with the aggregated countDistinct function from Spark, to pick the categorical columns having more than a certain number of categories/levels in them; say to drop them later or other treatment, like finding a higher level of them to roll them up to, sometimes simply bucketing into frequent and rare categories. Modify filter condition below to >= or <= in the function below.
Then select those columns with more than 100 levels, for example.
Last updated
Was this helpful?