Member-only story

Apache spark read file from hadoop file system

Dec 30, 2020

The default path for hadoop file system is configured at core-site.xml like

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://host:port</value>
        </property>
</configuration>

To get the file from spark, we will need to use SparkContext.

import org.apache.spark.SparkContext
val sc=SparkContext.getOrCreate()

Then we can get reference to the textFile by passing hadoop path:

val textFile = sc.textFile("hdfs://host:9000/user/ubuntu/books/alice.txt")

Get the first sentence of textFile for example

textFile.first()
String = The Project Gutenberg EBook of Alice’s Adventures in Wonderland, by Lewis Carroll

Happy coding ~~

Apache spark read file from hadoop file system

Written by Donald Le

No responses yet