Create Categories/Buckets Manually, and KS test
Motivation
We want to compare the distribution of two variables. We use Kolmogorov-Smirnov test to decide if the the distributions of the two variables are the same, or statistically significantly different.
The only problem, those two variables have different sample sizes, which prevents us from immediatly doing KS test, or they're simply too large if you're in a Spark dataframe. The solution to both hurdles is to discretize the two variables by putting the values of each of those two variables in the same buckets (categories) such that those intervals now cover all possible values of both variables. Those categories now contain the count of all values within that interval, for each variable.
How to go about it?
There's a Bucketizer in Spark ML module, why we need to categorize a column ourselves? well, because the Bucketizer picks and chooses. That is, if you have values ranging from 0 to 100, and you want buckets of, say 10, but you don't have any values in, say 60's or 70's, you'll end up with buckets like this: 1, 2, 3, 4, 5, 8, 9 skipping 60's and 70's altogether, which ruins analysis. Specially when you want to test if two columns follow the same distribution with Kolmogorov-Smirnov test.
This is a template that works in all cases, covering edge cases you didn't think about in your data, like anything negative or too large. I do that by making low end $-\infty$ and high end $+\infty$
Creating the buckets
In this case, condcol
is the column having the binary Type A and Type B of the data, i.e. 0/1
And varcol
is the variable I need to put in buckets (categorize)
And anothercol
is something else you possible want to group by as well, like dates for example, but you don't need to add it.
IMPORTANT: If you try to do bucketed_df.display()
and it errored out with ValueError ambiguous type of error, then replace f.pandas_udf
with f.udf
in line 28 in the snippet above.
This kind of "truth value ambiguous" error shows up with Pandas sometimes..
pandas_udf
was supposed to be faster than udf
in PySpark, since UDFs in PySpark are notourisouly slow, in comparision to Spark Scala's UDFs.
☞ Generally, it's advised to do frequent checks with .display()
, .show()
, or .count()
. Reason is, those force Spark to execute the transformations you've been feeding it up to that point, so if there's any hidden errors, those commands should reveal it.
Convert to Pandas to prepare to run KS test
Worth noting, in my use case here, I had a binary column that had a condition, basically if the data is, say Type A or Type B where those types are something related to the data I was dealing with. And the idea was to compare if the distribution of a certain variable, named it X, is different when data is Type A from when data is Type B. I refer to that as "two variables" to keep it in context with KS testing explanation you'd find online.
aggd_bucketed_pandas = aggd_bucketed.toPandas()
You can consider the step to compute tcond1
and tcond2
redundant, but it converts the wanted buckets with their values to a sorted Pandas Series so it's easier to compare and work with.
Plotting and KS testing
KS test results will be printed on the plot
Interpretting Results of Kolmogorov-Smirnov test
When doing 2 sample KS test, if the KS statistics is small of the p-value is high, then we can't reject the hypothesis thsat the distributions of the two samples are the same scipy docs
Resources
Numpy set UFuncs: w3schools.com/python/numpy/numpy_ufunc_set_operations.asp
Last updated