Persistence Caching in Apache Spark

Donald Le
Dec 31, 2020

Calculation in Spark is expensive. That’s why we should prioritize which kind of data should be calculated, and which should not. In order to keep the data to be cached ( no need to recalculate) we can use persist method.

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => (x * x,x+x))
result.persist(StorageLevel.DISK_ONLY)
println(result.count())…

--

--

Donald Le

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.