Member-only story

Flatmap vs map in Apache Spark

2 min readDec 31, 2020

Sometimes we want to produce multiple output elements for each input element. The operation to do this is called flatMap() . As with map() , the function we provide to flatMap() is called individually for each element in our input RDD. Instead of returning a single element, we return an iterator with our return values. Rather than producing an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.

For example, to print to the screen each string using map:

object LocalPi {

  val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

  def main(args: Array[String]): Unit = {
    val sc = new SparkContext(conf)
    val input = sc.parallelize(List("akka netflix", "production", "ata", "abc"))
    val result = input.map(x => x)
    println(result.collect().mkString(","))
  }

}

Show the result

akka netflix,production,ata,abc

When we need to work with traversable value, we will use flatMap, which will divide each word into multiple characters.

object LocalPi {

  val conf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("test")

  def main(args: Array[String]): Unit = {
    val sc = new SparkContext(conf)
    val input = sc.parallelize(List("akka netflix", "production", "ata", "abc"))
    val result = input.flatMap(x => x)
    println(result.collect().mkString(","))
  }

}

The result will be shown as:

a,k,k,a, ,n,e,t,f,l,i,x,p,r,o,d,u,c,t,i,o,n,a,t,a,a,b,c

Hope it helps

~~PEACE~~

References

Learning Spark Lightning-Fast Big Data Analysis

Flatmap vs map in Apache Spark

References

Written by Donald Le

No responses yet