Member-only story

Persistence Caching in Apache Spark

Donald Le
Dec 31, 2020

--

Calculation in Spark is expensive. That’s why we should prioritize which kind of data should be calculated, and which should not. In order to keep the data to be cached ( no need to recalculate) we can use persist method.

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => (x * x,x+x))
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect.mkString(","))
}
}

To remove the data from caching, using unpersist() method:

import org.apache.spark.storage.StorageLevel
import org.apache.spark.{SparkConf, SparkContext}

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => (x * x,x+x))
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect.mkString(","))
result.unpersist()
println(result.count())
println(result.collect.mkString(","))
}

}

Hope it helps~~

PEACE~~

--

--

Donald Le
Donald Le

Written by Donald Le

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.

No responses yet