Datafaker: An Alternative to Using Production Data

As developers or testers, we frequently have the need to test our systems. But getting access to realistic or useful data isn't always easy.

Erik Pragt

May. 22, 22 · Tutorial

Like (12)

Save

14.9K Views

As developers or testers, we frequently have the need to test our systems. In this process, be it unit testing, integration testing, or any other form of testing, the data is often the leading and deciding factor. But getting access to good data isn't always easy. Sometimes the data is quite sensitive, like medical or financial data. At other times, there's not enough data (for example, when attempting a load test), or sometimes the data you're looking for is hard to find. For cases like the above, there's a solution, called Datafaker.

Datafaker is a library for the JVM suitable to generate production-like fake data. This data can be generated as part of your unit tests or can be generated in the form of external files, such as CSV or JSON files, so it can serve as the input to other systems. This article will show you what Datafaker is, what it can do, and how you can use it in an effective way to improve your testing strategy.

What Is Datafaker?

Datafaker is a library written in Java and can be used by popular JVM languages such as Java, Kotlin, or Groovy. It started as a fork of the no longer maintained Javafaker, but it has seen many improvements since its inception. Datafaker consists of a core to handle the generation of data, and on top of that has a wide variety of domain-specific data providers. Such providers can be very useful, for example, to generate real-looking addresses, names, phone numbers, credit cards, and other data, or are sometimes a bit more on the light side, such as when generating the characters of the TV show Friends or the IT Crowd. No matter your use case, there's a high chance that Datafaker can provide data to your application. And, when there's a provider of data available, Datafaker provides the option for a pluggable system to create your own providers!

How to Use Datafaker

Datafaker is published to Maven Central on a regular basis, so the easiest way to get started with Datafaker is to use a dependency management tool like Maven or Gradle. To get started with Datafaker using Maven, you can include the dependency as follows:

    XML
   
 

   <dependency>
    <groupId>net.datafaker</groupId>
    <artifactId>datafaker</artifactId>
    <version>1.4.0</version>
</dependency>
  

Above, we're using version 1.4.0, the latest version at the time of writing this article. To make sure you're using the latest version, please check Maven Central.

Once the library has been included in your project, the easiest way to generate data is as follows:

    Java
   
   import net.datafaker.Faker;

Faker faker = new Faker();
System.out.println(faker.name().fullName()); // Printed Vicky Nolan

If you need more information, there's an excellent getting started with Datafaker guide in the Datafaker documentation.

A few things are going on which are maybe not immediately visible. For one, whenever you run the above code, it will print a random full name, consisting of the first name and last name. This name will be different every time. In our example above, it's using the default locale (English), and a random seed, which means a random name will be generated every time you run the above code. But if we want something a bit more predictable, and use perhaps a different language, we can:

    Java
   
   long seed = 1;
Faker faker = new Faker(new Locale("nl"), new Random(seed));
System.out.println(faker.name().fullName());

In the above example, we generate a random Dutch full name, but since we're using a fixed seed now, we know that no matter how often we run our code, the program will produce the same random values on every run. This helps a great deal if we want our test data to be slightly more repeatable, for example when we're doing a regression test.

While the above example shows how to generate names, it's possible to generate a very wide range of random data. Examples of these are addresses, phone numbers, credit cards, colors, codes, etc. A full list of these can be found in the documentation (https://www.datafaker.net/documentation/providers/). Besides these, Datafaker provides also more technical options such as random enums and lists, to make it easier to generate your random test data.

Custom Collections

In case you need to generate a larger set of test data, Datafaker provides several options to do so. One of these options is to use Fake Collections. Fake collections allow the creation of large sets of data in memory by providing a set of data suppliers to the collection method. This is best demonstrated using an example:

    Java
   
 

   List<String> names = faker.<String>collection()
    .suppliers(
        () -> faker.name().firstName(),
        () -> faker.name().lastName())
    .minLen(5)
    .maxLen(10)
    .build().get();
  

The above will create a collection of Strings with at least 5 elements, but with a maximum of 10 elements. Each element will either be a first name or a last name. It's possible to create many variations of the above, and similar examples are possible even when the data types are different:

    Java
   
 

   List<Object> data = faker.collection()
    .suppliers(
        () -> faker.date().future(10, TimeUnit.DAYS),
        () -> faker.medical().hospitalName(),
        () -> faker.number().numberBetween(10, 50))
    .minLen(5)
    .maxLen(10)
    .build().get();

System.out.println(data);
  

This will generate a list of Objects, since the `future`, `hospitalName` and `numberBetween` generators all have different return types.

Custom Providers

While Datafaker provides a lot of generators out of the box, it's possible that generators are missing, or that some of the generators work slightly different than your use-case needs. To support cases like this, it's possible to create your own data provider, either by providing a YML configuration file or by hardcoding the possible values in your code.

To create a provider of data, there are two steps involved: creating the data provider and registering the data provider in your custom Faker. An example can be found below, in which we'll create a specific provider for generating turtle names:

    Java
   
 

   class Turtle {
    private static final String[] TURTLE_NAMES = new String[]{"Leonardo", "Raphael", "Donatello", "Michelangelo"};
    private final Faker faker;

    public Turtle(Faker faker) {
        this.faker = faker;
    }

    public String name() {
        return TURTLE_NAMES[faker.random().nextInt(TURTLE_NAMES.length)];
    }
}
  

Since all methods to access providers in the Faker class are static, we need to create our own custom Faker class, which will extend the original Faker class so we can use all existing data providers, plus our own:

    Java
   
 

   class MyCustomFaker extends Faker {
    public Turtle turtle() {
        return getProvider(Turtle.class, () -> new Turtle(this));
    }
}
  

Using the custom faker is similar to what we've seen before:

    Java
   
   MyCustomFaker faker = new MyCustomFaker();
System.out.println(faker.turtle().name());

If you want to know more about creating your own provider, or using YML files to provide the data, the Datafaker custom provider documentation provides more information on this subject.

Exporting Data

Sometimes, you want to do more than generate the data in memory, and you might need to provide some data to an external program. A commonly used approach for this would be to provide the data in CSV files. Datafaker provides such a feature out of the box, and besides generating CSV files, it also has the option to generate JSON, YML, or XML files without the need for external libraries. Creating such data is similar to creating collections of data, which we've seen above.

Generation of files could be done in several ways. For instance, sometimes it is required to generate a document with random data. For that purpose, to generate a CSV file with random data, use the `toCsv` method of the `Format` class. An example can be found below:

    Java
   
 

   System.out.println(
    Format.toCsv(
            Csv.Column.of("first_name", () -> faker.name().firstName()),
            Csv.Column.of("last_name", () -> faker.name().lastName()),
            Csv.Column.of("address", () -> faker.address().streetAddress()))
        .header(true)
        .separator(",")
        .limit(5).build().get());
  

In the example above, 5 rows of data are generated, and each row consists of a first name, last name, and street address. It's possible to customize the generation of the CSV, for example by including or excluding the header, or by using a different separator char. More information on different options and examples of how to generate XML, YML, or JSON files can be found in the Datafaker fileformats documentation.

Exporting Data With Some Constraints

There is another way of CSV generation. So-called conditional generation when there are some constraints between data. Imagine we want to generate a document containing a person's name and his/her interests and a sample of interests. For the sake of simplicity, we are going to consider 2 fields of interest: Music and Food. For "Music" we want to see a sample of the music genre, for “Food” we want to see a sample of a dish e.g.

    Plain Text
   
   "name";"field";"sample"
"Le Ferry";"Music";"Funk"
"Mrs. Florentino Schuster";"Food";"Scotch Eggs"

To do that we need to generate a collection of such data.

First let's create rules for generating the objects, for instance:

    Java
   
 

   class Data {
    private Faker faker = new Faker();
    
    private String name;
    private String field;
    private String interestSample;

    public Data() {        
        name = faker.name().name();
        field = faker.options().option("Music", "Food");
        switch (field) {
            case "Music": interestSample = faker.music().genre(); break;
            case "Food": interestSample = faker.food().dish(); break;
        }
    }

    public String getName() {
        return name;
    }

    public String getField() {
        return field;
    }

    public String getInterestSample() {
        return interestSample;
    }
}
  

Now we can use the Data class to generate CSV data like demonstrated below:

    Java
   
 

   String csv = Format.toCsv(
        new Faker().<Data>collection()
            .suppliers(Data::new)
            .maxLen(10)
            .build())
    .headers(() -> "name", () -> "field", () -> "sample")
    .columns(Data::getName, Data::getField, Data::getInterestSample)
    .separator(";")
    .header(true)
    .build().get();
  

This will generate a CSV string with headers and columns containing random data, but with constraints between the columns we specified.

Conclusion

This article gave an overview of some of the options provided by Datafaker and how Datafaker can help in addressing your testing needs.

For suggestions, bugs, or other feedback, head over to the Datafaker project site and feel free to leave some feedback.

Test data CSV Production (computer science) Strings Data Types

Opinions expressed by DZone contributors are their own.

Related

Trending