Saving and Loading a Model with, and without MLFlow
MLFlow is an independent library that is very useful to save a trained model, and all that is related to it. DataBricks has good integration of it. It can be used with distributed and undistributed libraries like Spark ML, scikit-learn, tensorflow, keras, H2O etc. You can even train a model in a Python library, like scikit-learn, then load the saved model, and use it to train on a Spark's dataframe instead of having to convert it to Pandas', using mlflow.pyfunc.spark_udf
this is like applying the loaded Python model like a UDF to the Spark dataframe. DataBricks has great tutorial articles and example notebooks to walk you through many needed things in your work. You can run a notebook, in DataBricks, from within another notebook, with the command %run "other_notebook_path"
the cell having this command, can't have anything else in it, not even a comment. But you can add a title to the cell in DataBricks.
Here's an example how I saved a Spark ML model,
Loading a model saved WITH MLFlow
In another notebook, install the exact versions of every library you used in your training model notebook. Then do the exact same cleaning for your new data, as you did for your trainig data. Find the MLFlow experiment because you are going to need the run id to load the model. In DataBricks, you do that by going to the Machine Learning platform (the second button from top in the left menu), then searching for your username, and the notebook name where you trained your model and saved it with MLFLow. Click on the notebook name, then click on the run you want (in case there was more than one saved in that notebook), then click on the MLmodel file from the list below. You will see details about the saved model, including run_id. Grab that and the model name you used when you saved the model, which is called artifact_path in the MLmodel file you're at. Run the following one-line code in your other notebook:
or equivalently,
Then predict on your now cleaned, prepped data, the exact way we did to the training data before,
Notes
Logging model with MLFlow
The logging model command has to be through the correspondent module of MLFlow, which match the library you used to train. The name you use to save your model, you're gonna use later when loading the model.
About DataBricks installation command dbutils.library.install
dbutils.library.install
I trust you can search that online and find all you need about it. here are quick practical observations, Leaving out the "version" parameter automatically installs the latest version avaialble on your DataBricks runtime. "repo" parameter sometimes used by your company for authentication purposes; like if they ask you to run a notebook they've prepared beforehand having all the secret keys and all, which spits out a repo string for you to use in any library installation command in your DataBricks notebooks.
Saving a trained model WITHOUT MLFlow
VERY IMPORTANT make sure to print out the version of any library you used to train your model! whether that is scikit-learn, keras, tensorflow, etc. this is crucial when saving a model with Python or Spark's native capabilities. Because you will need to install those exact versions in the notebook you will import the saved model with to use it in prediction. That is because, the serliazed (pickled/saved) model will have the used library version in its top lines, it will require the exact same version when you need to use it again.
With Spark
Following above example,
If the above command didn't work, try GBTRegressor.load("DBFS_path)"
Where DBFS is the DataBricks "local" storage where you can save artifacts that aren't parquets. read more about DBFS Of course, you will need os
library to make new folder if it doesn't exist.
If you want to stick to dbutils alone, you can achieve the same thing,
Important Notes
You need to test out each time, but sometimes you can use the name itself to load with, like pca.load(pca_path), other times you need the Model object to load with, for example,
Test the saved path(s)! I also noticed this issue with DataBricks paths, at the time of writing this (May, 2022), that you need to specify the DBFS path a little differently between writing to, and loading from, just like I have in here. Test out writing and loading in the same notebok you have your trained model in before you shut down the cluster, to make sure your model has been successfully written to where you expected it to be.
With Python
Can be done with pickle or joblib libraries. Similar syntax.
However, you must note the libraries versions you're using! Because you need to have the exact version when you load the model afterwards, when using the Pythonic method.
Here's how to do it in DataBricks workspace, Saving
Loading
Resources
Save and Load Machine Learning Models in Python with scikit-learn How to Save and Load Your Keras Deep Learning Model
For text files and JSON files, see this good blog post Object Serialization with Pickle and JSON in Python
Another great resource to saving fitted models with Joblib, Pickle, and JSON libraries How to Save and Load ML Models in Python
Important Empirical Note about PCA Behavior in PySpark 3.0.1
I don't know if this is because the difference of scale of the input variables, which in my case were about 10,000 columns; or if it's a PySpark 3.0 thing, or a thoery thing (haven't checked).
But it seems like applying PCA, gives only 1-3 principal components which explain more than 0.0 of variance, the rest stand at 0.
Standardizing with StandardScaler didn't resolve this issue, but it seems like standardarizing with MinMax did. ☜
Last updated