Import and Ingest Data Into HDFS Using Kafka in StreamSets
Learn about reading data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.
Join the DZone community and get the full member experience.
Join For FreeStreamSets provides state-of-the-art data ingestion to easily and continuously ingest data from various origins such as relational databases, flat files, AWS, and so on, and write data to various systems such as HDFS, HBase, Solr, and so on. Its configuration-driven UI helps you design pipelines for data ingestion in minutes. Data is routed, transformed, and enriched during ingestion and made ready for consumption and delivery to downstream systems.
Kafka, an intermediate data store, helps to very easily replay ingestion, consume datasets across multiple applications, and perform data analysis.
In this blog, let's discuss reading the data from different data sources such as Amazon Simple Storage Service (S3) and flat files, and writing the data into HDFS using Kafka in StreamSets.
Prerequisites
- Install Java 1.8
- Install streamsets-datacollector-2.6.0.1
Use Case
Import and ingest data from different data sources into HDFS using Kafka in StreamSets.
Data Description
Network data of outdoor field sensors is used as the source file. Additional fields, dummy data, empty data, and duplicate data were added to the source file. The dataset has total record count of 600K with 3.5K duplicate records.
Sample data:
{"ambient_temperature":"16.70","datetime":"Wed Aug 30 18:42:45 IST
2017","humidity":"76.4517","lat":36.17,"lng":-
119.7462,"photo_sensor":"1003.3","radiation_level":"201","sensor_id":"c6698873b4f14b995c9e66ad0d8f29e3","
sensor_name":"California","sensor_uuid":"probe-2a2515fc","timestamp":1504098765}
Synopsis
- Read data from the local file system and produce data to Kafka.
- Read data from Amazon S3 and produce data to Kafka.
- Consume streaming data produced by Kafka.
- Remove duplicate records.
- Persist data into HDFS.
- View data loading statistics.
Reading Data From Local File System and Producing Data to Kafka
To read data from the local file system, perform the following:
- Create a new pipeline.
- Configure the File Directory origin to read files from a directory.
- Set Data Format as JSON and JSON content as Multiple JSON objects.
- Use Kafka Producer processor to produce data into Kafka. (Note: If there are no Kafka processors, install the Apache Kafka package and restart SDC.)
- Produce the data under topic sensor_data.
Reading Data From Amazon S3 and Producing Data to Kafka
To read data from Amazon S3 and produce data into Kafka, perform the following:
- Create another pipeline.
- Use Amazon S3 origin processor to read data from S3. (Note: If there are no Amazon S3 processors, install the Amazon Web Services 1.11.123 package available under Package Manager.)
- Configure processor by providing Access Key ID, Secret Access Key, Region, and Bucket name.
- Set the data format as JSON.
- Produce data under the same Kafka topic, sensor_data.
Consuming Streaming Data Produced by Kafka
To consume streaming data produced by Kafka, perform the following:
- Create a new pipeline.
- Use Kafka Consumer origin to consume Kafka produced data.
- Configure processor by providing the following details:
- Broker URI
- ZooKeeper URI
- Topic: Set the topic name as sensor_data (same data produced in previous sections)
- Set the data format as JSON.
Removing Duplicate Records
To remove duplicate records using Record Deduplicator processor, perform the following:
- Under Deduplication tab, provide the following fields to compare and find duplicates:
- Max. Records to Compare
- Time to Compare
- Compare
- Fields to Compare (for example, find duplicates based on sensor_id and sensor_uuid)
- Move the duplicate records to Trash.
- Store the unique records in HDFS.
Persisting Data into HDFS
To load data into HDFS, perform the following:
- Configure the Hadoop FS destination processor from stage library HDP 2.6.
- Select data format as JSON. (Note:
core-site.xml
andhdfs-site.xml
files are placed inHadoop-conf
directory (/var/lib/sdc-resources/hadoop-conf
.) While installing StreamSets, thesdc-resources
directory will be created.
Viewing Data Loading Statistics
Data loading statistics, after removing duplicates from different sources, look as follows:
References
Published at DZone with permission of Rathnadevi Manivannan. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments