Related to ML

Explanations, examples, code snippets, other useful info related to Machine Learning in Spark

Standarization - Standard Scaler

We saw in page "Machine Learning with PySpark" page of PySpark page group of this gitbook, that StandardScalar performs the formula (x-mu)/sigma StandardScalar - PySpark 3.1 official docs. Here, I want to show you what does that do to your data, and remind you of another way to unskew a right-skewed data i.e. data having a long right tail.

Remember that standarizing a variable doesn't change the shape of its distribution, only its scale. In other words, if it's right skewed with x-axis range [0, 15], standarizing doens't convert it to a Gaussian distribution, only changes its x-axis range to make it more compact. To "unskew" a right tail, you need to take the log of the variable values. The more data you have, the better you will see the effect of taking the log. Let's see that in action,

import numpy as np
import matplotlib.pyplot as plt

#create sample data
lst=[3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,
4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,7,7,7,7,8,8,8,8,9,9,10,11,12,13,14,15,16,17] 
   
#compute the standarized sample data: (x-mu)/sigma
lst2= [(i-np.mean(lst))/np.std(lst) for i in lst]

#plot them
_,ax=plt.subplots(nrows=1,ncols=2)
ax[0].hist(lst)
ax[0].set_title("data as-is")
ax[1].hist(lst2)
ax[1].set_title("standarized data")
plt.show()

#create the logarithm of original data
lst3=[np.log(i) for i in lst]

#plot them
_,ax=plt.subplots(nrows=1,ncols=2)
ax[0].hist(lst)
ax[0].set_title("data as-is")
ax[1].hist(lst3)
ax[1].set_title("log data")
plt.show()

Creating Categories, and Their Labels, Manually

Like I mentioned in "Machine Learning with PySpark" page, Bucketizer doesn't fill all the categories of your splits if there was no values of the variable (column) that falls into a particular split interval. Which you need sometimes if, say, you're comparing two variables and want to plot their histograms. Or you want to know exactly what value goes in which category and/or have a custom label for it.

I'm going to create a UDF function for that. Now, I know there's a nice Panda's function pandas.cut that does this, but we don't have a Pandas data frame, and we should steer clear from converting a Spark SQL data frame into Pandas. As I mentioned earlier in Spark Scala page group of this guidebook; a Spark data frame is a distributed one, a Pandas data frame is one piece where all of it must fit in the master node memory of your cluster. So if you have a biggish data frame, it might not be able to convert it and fit it in the memory at all; and if it did, that will eat up the memory for other operations, and your notebook will soon fail. It's best practice to do everything you can in Spark native methods; and UDFs when no native function suffices.

Since it's a long explanantion with code snippets, see page "Create Categories/Buckets and KS Test" subpage.

How to include zip codes in a ML model

One way is to use a Python package to extract the zip code centerpoint's longitude and latitude from the zip code and use those instead in the model. A few libraries to deal with zip codes, uszipcodes, geopy, googlemaps. googlemaps needs an API key and has limits on requests. You can then use the Python's function as a f.pandas_udf or classic udf (not recommended in pyspark as it's slow) to apply it to the zip code column in the dataframe. Google's search phrase for more information: "how to handle zip codes for ml models with pyspark" or "including zip codes in ml models python" Another way is to dummy it, or you can join them as a geographical region by proximity source

Last updated