Writing and Reading a Text File

Saving & Loading list of strings to a .txt file

.txt

Writing a List of Strings To a Text File (With New Line Character)

Main idea link

def writeFile(filePath:String, listObject:List[String]):Unit = {
    import java.io._
    val stringSeq = listObject.map(r => r + "\n").toSeq
    val file = new File(filePath)
    val bw = new BufferedWriter(new FileWriter(file))
    for (line <- lines) {
        bw.write(line)
    }
    bw.close
}

To use it, do writeFile("path/mytextfile.txt", myList)

This function adds a new line character to the end of each element of your list, otherwise it would save as one long string in the text file, without quotes.

Reading a List of Strings From a Text File

def readFile(filePath:String):List[String] = {
    val stringList = spark.sparkContext.textFile(filePath).collect.toList
    return stringList
}

To use it, do val myList = readFile("path/mytextfile.txt")

☞ String formatting is left to "Spark Scala Fundamentals" page of this book.

Saving Without a New Line Character

By default, it saves it as one long string, no New Line characters. As you saw from above, we had to manually add the New Line character to save a list. But sometimes we need to save as a long string, like what we did when we extracted, and saved the schema of a data frame as JSON. Find all details in page "Schema: Extracting, Reading, Writing to a Text File" page of this book.

import java.io._

val theString = "some string I have"

val theNewFileObject = new File("path/filename.txt")

val bw = new BufferedWriter(new FileWriter(theNewFileObject))

bw.write(theString)

bw.close

Reading One Long String, No New Line Character

import scala.io.Source

val string1 = "path/filename.txt"

val jstring1=Source.fromFile(string1).getLines.mkString

val bw = Source.fromFile("path/filename.txt")

bw.close

Source

☞ If Reading a Text File Errored Out

Sometimes, when you're on a cluster, trying to read a text file using .collect() you might get an error related to Hadoop and complier saying,

Name: java.lang.IllegalAccessError
Message: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.mapred.FileInputFormat
StackTrace:   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:200)

Solve it with reading text files through Hadoop commands instead, like so

val basePath = "path_to_the_datalake"

val filePath = s"$basePath/workspace/haya_toumy/filename.txt"

val text1DF = spark.read.
option("inferSchema", "false").
option("header", "false").
csv(f"$filePath")

val s1 = text1DF.rdd.map(_.mkString(",")).collect()(0)

Flatten and Read JSONs

Converts a Row to an RDD. In case you need it

import org.apache.spark.sql._ //for the Row
import org.apache.spark.rdd._ //for the RDD

// convert data frame to RDD
val rows: RDD[Row]= df.rdd
val flatRows= rows.flatMap(Row => toString)

//OR
sc.parallelize(row_name.toString()) //get Rows
df.map(x=>x.toString()).rdd //convert to RDD

Last updated