📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page

Was this helpful?

  1. PySpark
  2. PySpark

Creating a Data Frame

Creating a data frame of toy data for testing

I'm going to extend the example shown in Spark Scala page group, "Creating a Data Frame Page".

from pyspark.sql import Row

# create the tuple of Row objects (the RDD)
arrayStructureData = (
Row("James,,Smith", [2,3,20], ["Spark","Java"], "OH", "CA", 123, 456.78, 0.1),
Row("Michael,Rose,", [4,6,3], ["Spark","Java"], "NY", "NJ", 75, 234.01, 0.2),
Row("Robert,,Williams", [10,15,6], ["Spark","Python"], "UT", "NV", 82, 987.02, 0.7),
Row("John,,Doe", [20,25,62], ["C++","Python"], "TN", "TN", 98, 332.30, 0.9),
Row("Jane,,Doe", [50,55,65], ["Spark","C++"], "TX", "MN", 61, 980.23, 0.8),
Row("Jack,,Smith", [11,34,98], ["JavaScript","Go"], "CA", "MI", 110, 937.94, 0.5),
Row("Jillan,,Bernard", [2,1,9], ["R","Python"], "CT", "NY", 132, 128.95, 0.6),
Row("Phillip,,Kraft", [1,13,0], ["Python","Stata"], "RI", "FL", 95, 563.63, 0.1),
Row("Karl,,Lund", [74,92,14], ["Go","Python"], "CA", "WA", 55, 614.84, 0.4)
)

# define the column names
COLUMNS = ["name", "arr1", "lang", "state0", "state1", "num1", "num2", "num3"]

# make them into a data frame
dummy_df = spark.createDataFrame(arrayStructureData, COLUMNS)

dummy_df.printSchema()
dummy_df.show()
PreviousEssential Imports & General NotesNextUDFs

Last updated 4 years ago

Was this helpful?

There are other ways to create a data frame, like using toDF() or importing from different types of files. See for details.

this link
Result data frame