Flatmap vs map in Apache Spark

Donald Le
2 min readDec 31, 2020
Photo by Vadim Sadovski on Unsplash

Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap() . As with map() , the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, we return an iterator with our return values. Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.

For example, to print to the screen each string using map:

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List("akka netflix", "production", "ata", "abc"))
val result = input.map(x => x)
println(result.collect().mkString(","))
}

}

Show the result

akka netflix,production,ata,abc

When we need to work with traversable value, we will use flatMap, which will divide each word into multiple characters.

object LocalPi {

val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

def main(args: Array[String]): Unit = {
val sc = new SparkContext(conf)
val input = sc.parallelize(List("akka netflix", "production", "ata", "abc"))
val result = input.flatMap(x => x)
println(result.collect().mkString(","))
}

}

The result will be shown as:

a,k,k,a, ,n,e,t,f,l,i,x,p,r,o,d,u,c,t,i,o,n,a,t,a,a,b,c

Hope it helps

~~PEACE~~

References

Learning Spark Lightning-Fast Big Data Analysis

--

--

Donald Le
Donald Le

Written by Donald Le

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.

No responses yet