📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook