Zeppelin Notebooks
The notebook you see on Spark platform cluster managers
This is the default notebook type for Apache Spark cluster management platforms (like DataBricks, or others). It works similarly to Jupyter Notebooks, except much better. And here is why,
You can use different programming languages in the same notebook, in different cells. Just declare it with a magic command, shown below.
It does that by having different interpretters for each language you use. Variables can't be passed from one language to another, unless it's a dataframe, registered as a SQL
createOrReplaceTempView
(more on that below)
It gives you when you ran the cell, and how long it took by default.
It shows you progress bar on your query execution, and a button to see logs and performance of your nodes in the cluster, along with cluster logs, and other pertenant information.
Depending on the cluster management platform you're using, you might have other options buttons under each cell, like exporting result to an Excel file, or plot the result.
Changing Cell's Language
For markdown,
For Scala,
For SQL,
Well you get the gist.
And also, you can run another Zeppelin notebook from your workspace (or another user), by typing the the run magic, followed by relative path for the notebook,
Given that your notebook returns a result by design. So it's similar to importing a .py file as your custom libarary, in a way.
Passing Result from Scala Cell to PySpark/Python Cell
Suppose I have a dummy data frame built in Scala, but I want to work with it with PySpark. Then, you can have a cell where you create your Scala dummy data frame,
☞ Notice the magic at the beginning of the Zeppelin notebook cell. And the passing to python command at the very end of the cell.
Then you'll pass it to Python and read it in Python in the next cell like so,
Shortcuts
Creating New Cells
If you're currently inside a cell, hit ESC
button, then press A
to create a cell above, or B to create a cell below.
Deleting a Cell
If you're currently inside a cell, hit ESC
button, then double press D
quickly.
Commenting and Uncommenting
Cmd
+ /
to comment out or uncomment the line you're at, regardless of which langauge. Works in Markdown cells as well.
You can also comment and uncomment a selection of columns if you high light them.
Find and Replace
You're familiar with cmd
+ F
to find something in a webpage. Your Spark cluster manager Zippelin notebook will have a better way to do it with a dropdown menu of Find and Replace. What shortcut, depends on your platform. e.g. in DataBricks, they have the shortcut next to the selection in the dropdown menu of Edit. Those shortcuts work when you're not actively working in a cell.
Undo Delete Cell
This some of the most useful things for me personally as I'm quick to delete cells. You can find it from Edit dropdown menu in your platform. Usually they have a shortcut next to the selection.
Libraries
Importing Your Own Library
Similarly to having your own .py
file containing all the functions you frequently use, you can import a notebook from your workspace in your platform manager usually with the MAGIC command %run "relative_path_to_notebook_in_workspace"
then you'll have all your functions loaded and ready to use in your current notebook. At least this is the case for DataBricks, read more here DataBricks tips and tricks for Data Scientists. Any platform you use will probably have education and learning material for how to utilize their product and make the most out of it.
Downloading Libraries to Cluster
There are two ways,
Permanent download on the cluster. Thus, you can import it and use anytime you use the cluster for any notebook. Do that from the platform's Cluster management button, which is a GUI guiding you to download the library you want in the verison you want, or a bunch of them.
Temporary download for the notebook you're working on. This is usually done with a MAGIC command or in other means depending on your platform manager. For DataBricks, you can use
dbutils.library
to do that. See all methods ofdbutils
below, then usecmd
+F
to locate "dbutils.library" in the page, for full details.
☞ Resources,
I'm not sure whether or not DataBricks utilities usage depend on the VM owner i.e. Azure v.s. AWS; but it seems that I can find different documentations for DataBricks for each Cloud; like DataBricks Utilities for AWS, and Azure dbutils, as a seprate Azure and AWS DataBricks documentation.
Last updated
Was this helpful?