Persistence Caching in Apache Spark

Calculation in Spark is expensive. That’s why we should prioritize which kind of data should be calculated, and which should not. In order to keep the data to be cached ( no need to recalculate) we can use persist method.

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => (x * x,x+x))
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect.mkString(","))
}
}

To remove the data from caching, using unpersist() method:

import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => (x * x,x+x))
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect.mkString(","))
result.unpersist()
println(result.count())
println(result.collect.mkString(","))
}

}

Hope it helps~~

PEACE~~

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store