Zeppelin Notebooks

The notebook you see on Spark platform cluster managers

This is the default notebook type for Apache Spark cluster management platforms (like DataBricks, or others). It works similarly to Jupyter Notebooks, except much better. And here is why,

  • You can use different programming languages in the same notebook, in different cells. Just declare it with a magic command, shown below.

    • It does that by having different interpretters for each language you use. Variables can't be passed from one language to another, unless it's a dataframe, registered as a SQL createOrReplaceTempView (more on that below)

  • It gives you when you ran the cell, and how long it took by default.

  • It shows you progress bar on your query execution, and a button to see logs and performance of your nodes in the cluster, along with cluster logs, and other pertenant information.

  • Depending on the cluster management platform you're using, you might have other options buttons under each cell, like exporting result to an Excel file, or plot the result.

Changing Cell's Language

For markdown,

%md
### Title goes here

For Scala,

%scala
// your scala code goes here

For SQL,

%SQL
-- your SQL query goes here

Well you get the gist.

And also, you can run another Zeppelin notebook from your workspace (or another user), by typing the the run magic, followed by relative path for the notebook,

%run "./another_notebook_relative_path"

Given that your notebook returns a result by design. So it's similar to importing a .py file as your custom libarary, in a way.

Passing Result from Scala Cell to PySpark/Python Cell

Suppose I have a dummy data frame built in Scala, but I want to work with it with PySpark. Then, you can have a cell where you create your Scala dummy data frame,

%scala
//create the columns' values
val arrayStructureData = Seq(
Row("James,,Smith",List(2,3, 5),List("Spark","Java"),"OH","CA"),
Row("Michael,Rose,",List(4,6,3),List("Spark","Java"),"NY","NJ"),
Row("Robert,,Williams",List(10,15, 6),List("Spark","Python"),"UT","NV")
)
  
//create the schema
val arrayStructureSchema = new StructType().
add("name",StringType).
add("num", ArrayType(IntegerType)).
add("languagesAtWork", ArrayType(StringType)).
add("currentState", StringType).
add("previousState", StringType)

//put it together
var df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema
df.show

// pass it to Python
df.createTempView("df") 

☞ Notice the magic at the beginning of the Zeppelin notebook cell. And the passing to python command at the very end of the cell.

Then you'll pass it to Python and read it in Python in the next cell like so,

df = sqlContext.table("df")
df.printSchema()

Shortcuts

Creating New Cells

If you're currently inside a cell, hit ESC button, then press A to create a cell above, or B to create a cell below.

Deleting a Cell

If you're currently inside a cell, hit ESC button, then double press D quickly.

Commenting and Uncommenting

Cmd + / to comment out or uncomment the line you're at, regardless of which langauge. Works in Markdown cells as well.

You can also comment and uncomment a selection of columns if you high light them.

Find and Replace

You're familiar with cmd + F to find something in a webpage. Your Spark cluster manager Zippelin notebook will have a better way to do it with a dropdown menu of Find and Replace. What shortcut, depends on your platform. e.g. in DataBricks, they have the shortcut next to the selection in the dropdown menu of Edit. Those shortcuts work when you're not actively working in a cell.

Undo Delete Cell

This some of the most useful things for me personally as I'm quick to delete cells. You can find it from Edit dropdown menu in your platform. Usually they have a shortcut next to the selection.

Libraries

Importing Your Own Library

Similarly to having your own .py file containing all the functions you frequently use, you can import a notebook from your workspace in your platform manager usually with the MAGIC command %run "relative_path_to_notebook_in_workspace" then you'll have all your functions loaded and ready to use in your current notebook. At least this is the case for DataBricks, read more here DataBricks tips and tricks for Data Scientists. Any platform you use will probably have education and learning material for how to utilize their product and make the most out of it.

Downloading Libraries to Cluster

There are two ways,

  • Permanent download on the cluster. Thus, you can import it and use anytime you use the cluster for any notebook. Do that from the platform's Cluster management button, which is a GUI guiding you to download the library you want in the verison you want, or a bunch of them.

  • Temporary download for the notebook you're working on. This is usually done with a MAGIC command or in other means depending on your platform manager. For DataBricks, you can use dbutils.library to do that. See all methods of dbutils below, then use cmd + F to locate "dbutils.library" in the page, for full details.

☞ Resources,

I'm not sure whether or not DataBricks utilities usage depend on the VM owner i.e. Azure v.s. AWS; but it seems that I can find different documentations for DataBricks for each Cloud; like DataBricks Utilities for AWS, and Azure dbutils, as a seprate Azure and AWS DataBricks documentation.

Last updated