Member-only story

Apache spark read file from hadoop file system

Donald Le
Dec 30, 2020

--

Photo by Zoë on Unsplash

The default path for hadoop file system is configured at core-site.xml like

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://host:port</value>
</property>
</configuration>

To get the file from spark, we will need to use SparkContext.

import org.apache.spark.SparkContext
val sc=SparkContext.getOrCreate()

Then we can get reference to the textFile by passing hadoop path:

val textFile = sc.textFile("hdfs://host:9000/user/ubuntu/books/alice.txt")

Get the first sentence of textFile for example

textFile.first()
String = The Project Gutenberg EBook of Alice’s Adventures in Wonderland, by Lewis Carroll

Happy coding ~~

--

--

Donald Le
Donald Le

Written by Donald Le

A passionate automation engineer who strongly believes in “A man can do anything he wants if he puts in the work”.

No responses yet