Saving and Loading a Model with, and without MLFlow

MLFlow is an independent library that is very useful to save a trained model, and all that is related to it. DataBricks has good integration of it. It can be used with distributed and undistributed libraries like Spark ML, scikit-learn, tensorflow, keras, H2O etc. You can even train a model in a Python library, like scikit-learn, then load the saved model, and use it to train on a Spark's dataframe instead of having to convert it to Pandas', using mlflow.pyfunc.spark_udf this is like applying the loaded Python model like a UDF to the Spark dataframe. DataBricks has great tutorial articles and example notebooks to walk you through many needed things in your work. You can run a notebook, in DataBricks, from within another notebook, with the command %run "other_notebook_path" the cell having this command, can't have anything else in it, not even a comment. But you can add a title to the cell in DataBricks.

Here's an example how I saved a Spark ML model,

# in DataBricks, you need to install MLFlow on the Spark Session, within the notebook itself. Latest version always recommended. 
dbutils.library.installPyPI("mlflow")
dbutils.library.restartPython()

# load the libraries and modules here as you usually do, including importing mlflow itself.

# load, clean, prepare, split your data

# take a note of the versions of all the libraries you used! this is very important. 
# here, I used native Spark ML, so nothing else besides MLFlow needs to be noted
print(mlflow.__version__)

# loggin the model, metrics, parameters with MLFLow
with mlflow.start_run() as run: 
	# define the constant parameters to feed to model, for ease of modification or debugging
	train_set= reg_train # the name of my splitted train set
	test_set= reg_test 
	features_column= "features"
	label_column= TARGET_VARIABLE
	
	maxDepth=5
	maxIter=80
	subsamplingRate=0.5
	maxBins=64
	seed= RANDOM_SEED_VALUE
	
	# instantiate and fit model
	gbtr= GBTRegressor(labelCol= label_column, 
	predictionCol="prediction_gbtr", 
	maxDepth= maxDepth, 
	maxIter=maxIter
	subsamplingRate=subsamplingRate, 
	maxBins= maxBins, 
	seed= seed)
	gbtr.setFeaturesCol(features_column)
	gbtr_model= gbtr.fit(train_set)
	
	#create metrics
	re= RegressionEvaluator(predictionCol='prediction_gbtr', 
	labelCol=label_column, metricName='rmse') #'rmse', 'mse', 'r2', 'mae', 'var'
	test_preds= gbtr_model.transform(test_set)
	rmse=re.evaluate(test_preds)
	#you can create other metrics too, not just one
	
	# log metrics 
	mlflow.log_metrics({"RMSE":rmse})
	# log parameters
	mlflow.log_params({'featuresCol':features_column, "predictionCol":"prediction_gbtr", "labelCol":label_column, 
	"maxDepth":maxDepth, "maxIter":maxIter, "subsamplingRate":subsamplingRate, "maxBins":maxBins, "seed":seed})
	# log model
	mlflow.spark.log_model(gbtr_model, "GBTR_OHE_feb232022")

Loading a model saved WITH MLFlow

In another notebook, install the exact versions of every library you used in your training model notebook. Then do the exact same cleaning for your new data, as you did for your trainig data. Find the MLFlow experiment because you are going to need the run id to load the model. In DataBricks, you do that by going to the Machine Learning platform (the second button from top in the left menu), then searching for your username, and the notebook name where you trained your model and saved it with MLFLow. Click on the notebook name, then click on the run you want (in case there was more than one saved in that notebook), then click on the MLmodel file from the list below. You will see details about the saved model, including run_id. Grab that and the model name you used when you saved the model, which is called artifact_path in the MLmodel file you're at. Run the following one-line code in your other notebook:

trained_model= "runs:/xxx...xxx/GBTR_OHE_feb232022"

or equivalently,

run_id= 'xxx....xxx' # about 32 characters of numbers and letters
rel_model_path= "GBTR_OHE_feb232022"

trained_model= mlflow.spark.load_model("runs:/" + run_id + "/" + rel_model_path)

Then predict on your now cleaned, prepped data, the exact way we did to the training data before,

preds_df= trained_model.transform(clean_df)

Notes

Logging model with MLFlow

The logging model command has to be through the correspondent module of MLFlow, which match the library you used to train. The name you use to save your model, you're gonna use later when loading the model.

About DataBricks installation command dbutils.library.install

I trust you can search that online and find all you need about it. here are quick practical observations, Leaving out the "version" parameter automatically installs the latest version avaialble on your DataBricks runtime. "repo" parameter sometimes used by your company for authentication purposes; like if they ask you to run a notebook they've prepared beforehand having all the secret keys and all, which spits out a repo string for you to use in any library installation command in your DataBricks notebooks.

Saving a trained model WITHOUT MLFlow

VERY IMPORTANT make sure to print out the version of any library you used to train your model! whether that is scikit-learn, keras, tensorflow, etc. this is crucial when saving a model with Python or Spark's native capabilities. Because you will need to install those exact versions in the notebook you will import the saved model with to use it in prediction. That is because, the serliazed (pickled/saved) model will have the used library version in its top lines, it will require the exact same version when you need to use it again.

With Spark

Following above example,

# in the training notebook
gbtr_model.save("DBFS_path")
# in the loading notebook
from pyspark.ml.regression import GBTRegressorModel
gbtr_model= GBTRegressorModel.load("DBFS_path")

If the above command didn't work, try GBTRegressor.load("DBFS_path)" Where DBFS is the DataBricks "local" storage where you can save artifacts that aren't parquets. read more about DBFS Of course, you will need os library to make new folder if it doesn't exist.

import os
os.makedir("/dbfs/...")

If you want to stick to dbutils alone, you can achieve the same thing,

dbutils.fs.ls("/dbfs/FileStore/shared_uploads/myfolder/models/gbtr_model_0")
# to create a folder in DataBricks FileStore, 
dbutils.fs.mkdirs("./FileStore/shared_uploads/myfolder/models/")

from pyspark.ml.feature import PCA
pca= PCA(k=20, inputCol='features', outputCol='pca_features')
df2= pca.fit(df).transform(df)

# saving pca model
pca_path= "/dbfs/FileStore/shared_uploads/myfolder/models/pca_model_v0"
pca.save(pca_path)

# loading a previously saved pca model
pca= PCA.load(pca_path)

Important Notes

  • You need to test out each time, but sometimes you can use the name itself to load with, like pca.load(pca_path), other times you need the Model object to load with, for example,

gbtr_model.save("./FileStore/shared_uploads/myfolder/models/gbtr_with_pca_v0")

from pyspark.ml.regression import GBTRegressionModel
gbtr_model= GBTRegressionModel.load("/dbfs/FileStore/shared_uploads/myfolder/models/gbtr_with_pca_v0")
  • Test the saved path(s)! I also noticed this issue with DataBricks paths, at the time of writing this (May, 2022), that you need to specify the DBFS path a little differently between writing to, and loading from, just like I have in here. Test out writing and loading in the same notebok you have your trained model in before you shut down the cluster, to make sure your model has been successfully written to where you expected it to be.

With Python

Can be done with pickle or joblib libraries. Similar syntax.

However, you must note the libraries versions you're using! Because you need to have the exact version when you load the model afterwards, when using the Pythonic method.

Here's how to do it in DataBricks workspace, Saving

import sklearn
print(sklearn.__version__)
# repeat process to get all related modeling libraries

# saving model can also be done with joblib library. 
import pickle
pickle.dump(fitted_model_name, 
open("/path/your_model_name.sav", "wb")) # means write binary

Loading

# in another notebook, or when you want to load and use the saved model
# whatever version you have in the fitted notebook for libraries, you should have here also!! 
# for example,
dbutils.library.installPyPI("xgboost", version="x.xx")
dbutils.library.installPyPI("scikit-learn", version="x.xx.x")
dbutils.library.restartPython()

# do the rest of your imports as usual
import pickle
model_name= pickle.load()
open("/path/your_model_name.sav", "rb") # means read binary

Resources

Save and Load Machine Learning Models in Python with scikit-learn How to Save and Load Your Keras Deep Learning Model

For text files and JSON files, see this good blog post Object Serialization with Pickle and JSON in Python

Another great resource to saving fitted models with Joblib, Pickle, and JSON libraries How to Save and Load ML Models in Python

Important Empirical Note about PCA Behavior in PySpark 3.0.1

I don't know if this is because the difference of scale of the input variables, which in my case were about 10,000 columns; or if it's a PySpark 3.0 thing, or a thoery thing (haven't checked).

But it seems like applying PCA, gives only 1-3 principal components which explain more than 0.0 of variance, the rest stand at 0.

Standardizing with StandardScaler didn't resolve this issue, but it seems like standardarizing with MinMax did. ☜

Last updated