Creating a Data Frame

Creating a data frame of toy data for testing

I'm going to extend the example shown in Spark Scala page group, "Creating a Data Frame Page".

from pyspark.sql import Row

# create the tuple of Row objects (the RDD)
arrayStructureData = (
Row("James,,Smith", [2,3,20], ["Spark","Java"], "OH", "CA", 123, 456.78, 0.1),
Row("Michael,Rose,", [4,6,3], ["Spark","Java"], "NY", "NJ", 75, 234.01, 0.2),
Row("Robert,,Williams", [10,15,6], ["Spark","Python"], "UT", "NV", 82, 987.02, 0.7),
Row("John,,Doe", [20,25,62], ["C++","Python"], "TN", "TN", 98, 332.30, 0.9),
Row("Jane,,Doe", [50,55,65], ["Spark","C++"], "TX", "MN", 61, 980.23, 0.8),
Row("Jack,,Smith", [11,34,98], ["JavaScript","Go"], "CA", "MI", 110, 937.94, 0.5),
Row("Jillan,,Bernard", [2,1,9], ["R","Python"], "CT", "NY", 132, 128.95, 0.6),
Row("Phillip,,Kraft", [1,13,0], ["Python","Stata"], "RI", "FL", 95, 563.63, 0.1),
Row("Karl,,Lund", [74,92,14], ["Go","Python"], "CA", "WA", 55, 614.84, 0.4)
)

# define the column names
COLUMNS = ["name", "arr1", "lang", "state0", "state1", "num1", "num2", "num3"]

# make them into a data frame
dummy_df = spark.createDataFrame(arrayStructureData, COLUMNS)

dummy_df.printSchema()
dummy_df.show()

There are other ways to create a data frame, like using toDF() or importing from different types of files. See this link for details.

Last updated