Accessing Hadoop HDFS Data Using Node.js and the WebHDFS REST API
HDFS files are a popular means of storing data. Learn how to use Node.js and the WebHDFS RESTful API to manipulate HDFS data stored in Hadoop.
Join the DZone community and get the full member experience.
Join For FreeApache Hadoop exposes services for accessing and manipulating HDFS content with the help of WebHDFS REST APIs. To check out this official documentation, click here.
Available Services
Below are the set of services available:
1) File and Directory Operations
1.1 Create and Write to a File: CREATE (HTTP PUT)
1.2 Append to a File: APPEND (HTTP POST)
1.3 Open and Read a File: OPEN (HTTP GET)
1.4 Make a Directory: MKDIRS (HTTP PUT)
1.5 Rename a File/Directory: RENAME (HTTP PUT)
1.6 Delete a File/Directory: DELETE (HTTP DELETE)
1.7 Status of a File/Directory: GETFILESTATUS (HTTP GET)
1.8 List a Directory: LISTSTATUS (HTTP GET)
2) Other File System Operations
2.1 Get Content Summary of a Directory: GETCONTENTSUMMARY (HTTP GET)
2.2 Get File Checksum: GETFILECHECKSUM (HTTP GET)
2.3 Get Home Directory: GETHOMEDIRECTORY (HTTP GET)
2.4 Set Permission: SETPERMISSION (HTTP PUT)
2.5 Set Owner: SETOWNER (HTTP PUT)
2.6 Set Replication Factor: SETREPLICATION (HTTP PUT)
2.7 Set Access or Modification Time: SETTIMES (HTTP PUT)
Enabling the WebHDFS API
Make sure the config parameter dfs.webhdfs.enabled is set to true in the hdfs-site.xml file (this config file can be found inside {your_hadoop_home_dir}/etc/hadoop
.
<configuration>
<property>
.....
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Connecting to WebHDFS From Node.js
I am hoping you are familiar with Node.js and package installations. Please go through this if you are not. There is an npm module, "node-webhdfs," with a wrapper that allows you to access Hadoop WebHDFS APIs. You can install the node-webhdfs package using npm:
npm install webhdfs
After the above step, you can write a Node.js program to access this API. Below are a few steps to help you out.
Import Dependent Modules
Below are external modules to be imported:
const WebHDFS = require("webhdfs");
var request = require("request");
Prepare Connection URL
Let us prepare the connection URL:
let url = "http://<<your hdfs host name here>>";
let port = 50070; //change here if you are using different port
let dir_path = "<<path to hdfs folder>>";
let path = "/webhdfs/v1/" + dir_path + "?op=LISTSTATUS&user.name=hdfs";
let full_url = url+':'+port+path;
List a Directory
Acess the API and get the results:
request(full_url, function(error, response, body) {
if (!error && response.statusCode == 200) {
console.log(".. response body..", body);
let jsonStr = JSON.parse(body);
let myObj = jsonStr.FileStatuses.FileStatus;
let objLength = Object.entries(myObj).length;
console.log("..Number of files in the folder: ", objLength);
} else {
console.log("..error occured!..");
}
}
Here is the sample request and response of LISTSTATUS
API:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#LISTSTATUS
Get and Display Content of an HDFS File
Assign the HDFS file name with a path:
let hdfs_file_name = '<<HDFS file path>>' ;
The below code will connect to HDFS using the WebHDFS
client instead of the request module we used in the above section:
let hdfs = WebHDFS.createClient({
user: "<<user> >",
host: "<<host/IP >>",
port: 50070, //change here if you are using different port
path: "webhdfs/v1/"
});
The below code is going to be reading and displaying the contents of an HDFS file,
let remoteFileStream = hdfs.createReadStream( hdfs_file_name );
remoteFileStream.on("error", function onError(err) { //handles error while read
// Do something with the error
console.log("...error: ", err);
});
let dataStream = [];
remoteFileStream.on("data", function onChunk(chunk) { //on read success
// Do something with the data chunk
dataStream.push(chunk);
console.log('..chunk..',chunk);
});
remoteFileStream.on("finish", function onFinish() { //on read finish
console.log('..on finish..');
console.log('..file data..',dataStream);
});
Here is the sample request and response of OPEN API:
https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#OPEN
How to Read All Files in a Directory
This is not something straightforward, as we don't have a direct method, but we can achieve it by combining the above two operations - reading a directory and then reading the files in that directory one by one.
Conclusion
I am hoping you got some idea about connecting to HDFS and doing basic operations by using Node and the WebHDFS module. All the best!
Opinions expressed by DZone contributors are their own.
Comments