📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page

Was this helpful?

  1. Spark Scala

Spark Scala

Explanations, details, workflow, and syntax of Spark, with Scala

Spark as a framework has three main interfaces to make it more conveneient for data professionals; those are Scala, Python, and R.

Spark is written in Scala, which is in its turn, based on Java. Thus you can import libraries from Java too. And so, it is allegedly faster to do use the native Scala interface than PySpark or SparkR, and you'll get updates first from Spark. However, if you're not familiar with Scala, that will be a learning curve for you initially. As it's a rigid and more particular language in comparison to Python.

I initially started with PySpark, then quickly switched to Spark Scala as it made more sense to me, I didn't like the mix between Python fluidity and Spark preciseness. Also, there wasn't a lot of support from the online community to PySpark in comparison to Spark Scala by the time I was putting these notes together.

If you do end up wanting to learn Scala, I included a page in this group/section to get you up to speed quickly with the fundamentals like functions, logic, types, loops, and if-else statements. You will also probably see it all over the Spark Scala section.

For all the above, this group of pages will contain all I know about Spark. The PySpark section will merely contain applications in PySpark, and examples.

PreviousIntroductionNextReading Files, Essential Imports & Docs + Optimization

Last updated 4 years ago

Was this helpful?