Spark Tutorial: Validating Data in a Spark DataFrame - Part One
There's more than one way to skin a cat...four easy method to validate data in a Spark DataFrame.
Join the DZone community and get the full member experience.
Join For FreeRecently, in conjunction with the development of a modular, metadata-based ingestion engine that I am developing using Spark, we got into a discussion relating to data validation. Data validation was a natural next step of data ingestion and that is why we came to that topic.
You might be wondering, "What is so special about data validation? Is it because of Spark?" The reason for this article is partly due to Spark, but more importantly due to the fact that it demonstrates the power of Spark and also illustrates the principle that there is more than one method available to achieve our goal.
You may also like: Apache Spark an Enging for Large-Scale Data Processing
The task at hand was pretty simple — we wanted to create a flexible and reusable library of classes that would make the task of data validation (over Spark DataFrames) a breeze. In this article, I will cover a couple of techniques/idioms used for data validation. In particular, I am using the null check (are the contents of a column 'null'). In order to keep things simple, I will be assuming that the data to be validated has been loaded into a Spark DataFrame named "df."
Method One: Filtering
One of the simplest methods of performing validation is to filter out the invalid records. The method to do so is val newDF = df.filter(col("name").isNull)
.
A variant of this technique is:
val newDF = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))
This technique is overkill — primarily because all the records in newDF
are those records where the name column is not null. Hence, adding a new column with a "true" value is totally unnecessary, as all rows will have the value of this column as 'true'.
Method Two: When/Otherwise
The second technique is to use the "when" and "otherwise" constructs.
val newDF = df.withColumn("nameIsNull", when(col("name") === null ||
col("name").equals(""), true).otherwise(false))
This method adds a new column, that indicates the result of the null comparison for the name column. After this technique, cells in the new column will contain both "true" and "false," depending on the contents of the name column.
Method Three: Using Expr
Another technique is to use the "expr" feature.
val newDF = df.withColumn("nameIsNull", expr("case when name == null then true else false end"))
Method Four: Overkill
Now, look at this technique.
var dfFinal: DataFrame = _
var dfTrue = df.filter(col("name").isNull).withColumn("nameIsNull", lit(true))
var dfFalse = df.filter(!col("name").isNull).withColumn("nameIsNull", lit(false))
if ( dfTrue != null ) {
dfFinal = dfTrue.union(dfFalse)
} else {
dfFinal = dfFalse
}
While valid, this technique is clearly an overkill. Not only is it more elaborate when compared to the previous methods, but it is also doing double the work. It is scanning the DataFrame twice - once to evaluate the "true" condition and once more to evaluate the "false" condition.
In this article, I have covered a few techniques that can be used to achieve the simple task of checking if a Spark DataFrame column contains null. The techniques not only illustrate the flexibility of Spark, but also proves the point that we can reach the same end goal using multiple ways.
Obviously, there are some tradeoffs, but sometimes, we may want to choose a method that is simple to understand and maintain, rather than using a technique just because the API provides it. In the next article, I will cover how something similar can be achieved using UDFs.
Related Articles
Opinions expressed by DZone contributors are their own.
Comments