DataBricks Useful Commands

Visualization on DataBricks

Databricks actually provides a tableau-like visualization solution. The display() function gives a friendly UI to generate any aggregate plots you like. It's a button that shows under the cell with the display() command. It will generate the graph on the first 1000 rows by default, once you're satsified with your graph, you might want to click "aggregate over whole data" link that shows on Plot Options UI.

Or you can use .histogram of .hist on a PySpark data frame. --> see python - Pyspark: show histogram of a data frame column - Stack Overflow

Introduction to DataFrames - Scala | Databricks on AWS on operations on DataFrames in Scala; from DataBricks. Plot Data from Apache Spark | Python/v3 | Plotly guide. You can also use Matplotlib on a Python object, like Pandas or Numpy.

NOTE: And if your matplotlib figure didn't show, even with plt.show() wrap it in display command, display(plt.show())

Folders in a path, DataBricks

If you just use dbutils.fs.ls(path) it will return confusing list that has folder name and other things, which is hard to scan through, visually or programatically. The function below parse that for you and extract only folder names

def files_in_path(path):
	lst=[str(item).split('name=')[1].split(",")[0].split("'")[1].strip('/') for item in dbutils.fs.ls(path)]
	return lst

Read docs on DataBricks dbutils

DataBricks File System Workspace Storage, DBFS

Aside from the the cloud blob storage, whether that is Azure, AWS, or Google Cloud; there's also a "local" DataBricks storage path starts with /dbfs we used to be able to navigate through that with a GUI by clicking on the topmost button on the left menu in DataBricks Notebook interface, but not anymore. It still exists nonetheless.

To explore what files currently exist in it, use command dbutils.fs.ls("/dbfs"/) or the os Python library equivalent of os.listdir("/dbfs")

I use this path when I save my PySpark models while I'm testing. You can also save .txt, .json, .csv small files.

To make it orderly, I recommend creating your own folder in ("/dbfs/FileStore/shared_uploads/my_folder/models/") I belive ("/dbfs/FileStore/shared_uploads/") exist by default. Create your folder and its subfolders using os library or DataBricks utilities dbutils.

Don't confuse that with the local path that starts with ("./FileStore/shared_uploads/") because it techniqually doesn't exist, yet somehow you can save models to, but not load from. It confused me at the time of writing this page. Again verify yourself, and load your model to make sure it has been saved where you intended before you shut down your notebook, to not lose your work.

Deleting a table

No need to remind you to use extra caution when deleting stuff, you already know. right? right?! I use it to drop tables that I have saved. I make sure to end the name of each table I create with my initials, such that it's easy to retrieve and locate. DataBricks' dbutils.fs.rm("path", True) which will return True if removed succesfully. Or using SQL Context DROP TABLE

%sql
DROP TABLE [IF EXISTS] {table_name}

If using Azure Synapse Analytics, the magic command is double %% instead of just one. See Azure SQL docs entry

Passing arguments to a notebook when running from another notebook

You know that you can call/run a DataBricks notebook from your current notebook with the command

%run "path"

Whether that is the absolute or relative path. Make sure there's nothing else in that cell with the run command, not even a comment. otherwise it won't work.

You can also pass variables to change or override the ones in the notebook you're calling. One way, is to already know the variables you want to tweak when writing the first notebook, then when you call it, you fix the value of the wanted variables in the notebook you're calling it in, e.g.

var1= "some_value"

%run "./first_notebook"

Assuming that usually your imported notebook is a collection of functions and procedures you want to use over and over. Another way, is to use DataBricks' dbutils.notebook.run here's the template

dbutils.notebook.run(
	"notebook-name", 
	3600, 
	{"parameterFloat":2.5, "parameterString":"abc"})

Where 3600 is the timeout in seconds.

Resources: notebook workflows great Medium post comparing the two methods This post the author mentions that you can't actually override functions and variables with the dbutils.notebook.run I haven't tried it myself.

NOTE

In other Spark Zeppelin notebook providers, for example Azure Synapse Analytics, you can accomplish that with this template

%run "other_notebook_path" {"parameterInt": 1, "parameterFloat":2.5, "parameterBool":True, "parameterString":"abc"}

Resource: Azure Synapse Docs section: Notebook Reference By the time of writing this post, Azure Synapse Analytics lagged behind DataBricks in adopting Spark 3.0 and it's bare bone in terms of Spark, doesn't have all the nice optimizations that DataBricks provide for you. Though it does have nice features like IDE-like code typing assistance, and notebook outline. They call their Spark clusters: Spark Pools. But Microsoft provided a really good documentation for their Azure Synapse Analytics, and one can argue that having all your data needs, in theory, in one place is helpful. So it's a trade off.

Collection of quick things to do in DataBricks

displaying x number or rows

Using display() on a dataframe, shows the first 1000 rows by default, where you can scroll through. If for some reason you want just the first, say 10, you'd need to use .limit like this df.limit(10).display() or display(df.limit(10)) Using df.display(10) doesn't take effect.

displaying matplotlib graphs

If for some reason your plot didn't show, even if you add plt.show() at the end of your plotting snippet, add display(plt.show()) instead; that will print it out for sure.

exploring and editing blob storage

DataBricks has its own tools for that, called dbutils with its sub module fs. Search that online for all your needs to copy, move, delete, etc a table from blob storage. Use dbutils.fs.ls("blob_storage_path")' to explore it. Results will be a list, that needs some text editing with .split() if you want only the name of the folder. Here's my practicaly not so pretty way of doing it,

def files_in_path(path):
  lst= [str(item).split("name=")[1].split(",")[0].split("'")[1].strip("/") for item in dbutils.fs.ls(path)]
	return lst

installing libraries in a session

You can install the library on the cluster itself, but I think it's a better practice to install it in the notebook on the Spark session instead; such that you always know the version, and you can upgrade the libraries easily

#this is the very first cell in the notebook after authenticaion cell if any
dbutils.library.installPyPI("package", version="xx.x.xx", repo= "repo", extras="extras")
#repeat command for all the libraries you need
dbutils.library.restartPython()

After that, import the libraries you just installed, and any other library from Spark or Python in the cell after.

Reveal Python version in the current cluter

import sys
sys.version

Reading Delta tables

spark.read.format("delta").timestampAsOf or versionAsOf
``

Last updated