📈
Spark for Data Scientists
  • Introduction
  • Spark Scala
    • Spark Scala
      • Reading Files, Essential Imports & Docs + Optimization
      • Accessing Variables in Data Frames
      • Operations on One Column
      • Operations on Multiple Columns at Once
      • Filtering & Nulls
      • Window
      • DataFrame Manipulations
      • Aggregations
      • Array Columns
      • Spark Scala Fundamentals
      • User Defined Functions - UDF
      • Writing and Reading a Text File
      • Schema: Extracting, Reading, Writing to a Text File
      • Creating a Data Frame
      • Estimating, Partitioning, Writing/Saving a DataFrame
      • Machine Learning in Spark
      • Catch-It-All Page
  • PySpark
    • PySpark
      • Essential Imports & General Notes
      • Creating a Data Frame
      • UDFs
      • Operations on Multiple Columns at Once
      • Correlations in PySpark & Selecting Variables Based on That Correlation Threshold
      • Merging and Cleanup Duplicate Columns
      • Machine Learning with PySpark
      • Full Worked Random Forest Classifier Example
      • Related to ML
        • Modeling in PySpark as a Function for Faster Testing
        • Saving and Loading a Model with, and without MLFlow
        • Pipeline in PySpark 3.0.1, By Example
        • CountVectorizer to one-hot encode multiple columns at once
        • Cross Validation in Spark
        • Create Categories/Buckets Manually, and KS test
        • Data Frame Partitioning: Exhaustive and Mutually Exclusive Partition
      • Collection of Notes. Catch-It-All Page
      • Related Python Snippets
      • Appendix - Plotting in Python
  • Zeppelin Notebooks
    • Zeppelin Notebooks
    • DataBricks Useful Commands
Powered by GitBook
On this page
  • .txt
  • Writing a List of Strings To a Text File (With New Line Character)
  • Reading a List of Strings From a Text File
  • Saving Without a New Line Character
  • ☞ If Reading a Text File Errored Out
  • Flatten and Read JSONs

Was this helpful?

  1. Spark Scala
  2. Spark Scala

Writing and Reading a Text File

Saving & Loading list of strings to a .txt file

PreviousUser Defined Functions - UDFNextSchema: Extracting, Reading, Writing to a Text File

Last updated 4 years ago

Was this helpful?

.txt

Writing a List of Strings To a Text File (With New Line Character)

def writeFile(filePath:String, listObject:List[String]):Unit = {
    import java.io._
    val stringSeq = listObject.map(r => r + "\n").toSeq
    val file = new File(filePath)
    val bw = new BufferedWriter(new FileWriter(file))
    for (line <- lines) {
        bw.write(line)
    }
    bw.close
}

To use it, do writeFile("path/mytextfile.txt", myList)

This function adds a new line character to the end of each element of your list, otherwise it would save as one long string in the text file, without quotes.

Reading a List of Strings From a Text File

def readFile(filePath:String):List[String] = {
    val stringList = spark.sparkContext.textFile(filePath).collect.toList
    return stringList
}

To use it, do val myList = readFile("path/mytextfile.txt")

☞ String formatting is left to "Spark Scala Fundamentals" page of this book.

Saving Without a New Line Character

By default, it saves it as one long string, no New Line characters. As you saw from above, we had to manually add the New Line character to save a list. But sometimes we need to save as a long string, like what we did when we extracted, and saved the schema of a data frame as JSON. Find all details in page "Schema: Extracting, Reading, Writing to a Text File" page of this book.

import java.io._

val theString = "some string I have"

val theNewFileObject = new File("path/filename.txt")

val bw = new BufferedWriter(new FileWriter(theNewFileObject))

bw.write(theString)

bw.close

Reading One Long String, No New Line Character

import scala.io.Source

val string1 = "path/filename.txt"

val jstring1=Source.fromFile(string1).getLines.mkString

val bw = Source.fromFile("path/filename.txt")

bw.close

☞ If Reading a Text File Errored Out

Sometimes, when you're on a cluster, trying to read a text file using .collect() you might get an error related to Hadoop and complier saying,

Name: java.lang.IllegalAccessError
Message: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
StackTrace:   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)

Solve it with reading text files through Hadoop commands instead, like so

val basePath = "path_to_the_datalake"

val filePath = s"$basePath/workspace/haya_toumy/filename.txt"

val text1DF = spark.read.
option("inferSchema", "false").
option("header", "false").
csv(f"$filePath")

val s1 = text1DF.rdd.map(_.mkString(",")).collect()(0)

Flatten and Read JSONs

Converts a Row to an RDD. In case you need it

import org.apache.spark.sql._ //for the Row
import org.apache.spark.rdd._ //for the RDD

// convert data frame to RDD
val rows: RDD[Row]= df.rdd
val flatRows= rows.flatMap(Row => toString)

//OR
sc.parallelize(row_name.toString()) //get Rows
df.map(x=>x.toString()).rdd //convert to RDD

Main idea link
Source