Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
Correlations in PySpark
from pyspark.ml.stat import Correlation
pearson= Correlation.corr(df, 'features', 'pearson').collect()[0][0]
print(str(pearson).reaplace("nan", "NaN"))
np.savetxt("path/pearsonCorrMatrix.csv", pearson.toArray(), delimiter=',')
spearman= Correlation.corr(df, 'features', method='spearman').collect()[0][0]
np.savetxt("path/spearmanCorrMatrix.csv", spearman.toArray(), delimiter=',')# loading already saved correlation matricies
spearmanCorrMatrix= np.loadtxt("path/spearmanCorrMatrix.csv", delimiter=',')
pearsonCorrMatrix= np.loadtxt("path/pearsonCorrMatrix.csv", delimiter=',')pearson_df= pd.DataFrame(pearsonCorrMatrix,
columns=df.columns,
index=df.columns).applymap(lambda x: abs(x))
spearman_df= pd.DataFrame(spearmanCorrMatrix,
columns=df.columns,
index=df.columns).applymap(lambda x: abs(x))Select values with a specific threshold i.e. with chosen level of correlation value
Last updated