Predictive Services Development With Java and Weka

In this article, we will briefly describe how to create predictive services with Java and Weka library (Waikato Environment for Knowledge Analysis, Waikato University).

Aleksei Chaika

May. 03, 23 · Tutorial

Likes (3)

Comment

Save

6.2K Views

In this article, we will briefly describe how to create predictive services with Java and Weka library (Waikato Environment for Knowledge Analysis, The University of Waikato).

We will develop a Java application and consider a real-life example — the prediction of real estate prices in Seattle depending on parameters (area, distance from center, bedrooms). The application will learn real estate prices, and we will analyze its predictions.

The city of Seattle was chosen randomly for this article.

Brief Introduction to Machine Learning in Java

If you never heard about machine learning, or you're simply afraid of it, please don't :)

Machine learning is a rapidly growing field of artificial intelligence (AI) that involves the development of algorithms and models that enable computers to learn and make predictions or decisions based on data. Machine learning could be used for image and object recognition, natural language processing (NLP), customer relationship management, recommendation systems, predictive maintenance, and much more.

Indeed, it is not that easy to develop everything from scratch, but Java provides a rich ecosystem of libraries and tools that can be leveraged for machine learning tasks, such as data preprocessing, feature engineering, model training, and evaluation.

Just some examples of tools that help developers to implement machine learning algorithms and build predictive models — Deeplearning4j, MLlib (Apache Spark), Smile (Statistical Machine Intelligence and Learning Engine), DL4J (Deep Learning for Java), Encog (Embedded Neural Network and Genetic Programming Framework), JOONE (Java Object Oriented Neural Engine), Weka (Waikato Environment for Knowledge Analysis).

Development of Predictive Application

In this article, we will develop an application responsible for predicting real estate prices in Seattle depending on parameters (area, distance from the center, and bedrooms).

We will use a machine learning library called Weka (Waikato Environment for Knowledge Analysis, The University of Waikato), which provides a wide range of machine learning algorithms and tools for data mining, feature selection, and model evaluation. Weka is an open-source software that is widely used in both academic and industry settings for a variety of machine-learning tasks.

1. Application Dependencies

First, in order to start using Weka in the Java application, we have to add it to the classpath. You can find the required artifacts in the Maven repository. Just add it to application properties:

using Maven

     XML 
   
 
 
   <dependencies>
    <!-- https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable -->
    <dependency>
        <groupId>nz.ac.waikato.cms.weka</groupId>
        <artifactId>weka-stable</artifactId>
        <version>3.8.6</version>
    </dependency>
</dependencies> 
  

using Gradle

     Groovy 
   
   // https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable
implementation group: 'nz.ac.waikato.cms.weka', name: 'weka-stable', version: '3.8.6'

2. Initial Dataset

Machine learning applications are mostly designed to autonomously learn from data and improve their performance over time.

But in this article, we will simply provide an application with the initial dataset, i.e. some real estate prices in Seattle. To do that, let's create an arff file and collect some data there.

% All comments have to start with '%' character

% Relation name
@relation prices

% Fields order
@attribute area numeric     % area, sqft
@attribute Bedrooms numeric % bedrooms, bd
@attribute distance numeric % distance to the center, mi
@attribute price numeric    % price, $

% Declaring examples
@data
% Area, sqft    Bedrooms    Distance, mi    Price, $
3150            4           1.2             1349000
2290            3           1.2             1050000
2940            6           3.0             1850000
1107            2           3.2             729950
1122            2           6.6             599950
1350            2           6.9             598000
1390            3           12.2            449950
1660            5           12.3            619950
1540            4           12.9            1024900

Such a file could be simply placed in the application resources folder or in any other place.

There are just several records. This is enough for this article, but for real-world applications, it isn't. In order to achieve high accuracy, much more data should be provided, automated learning techniques should be used, and data should be collected continuously.

However, it's important to carefully evaluate and validate the results obtained from automated processes to ensure the reliability and accuracy of the machine learning models in real-world applications.

3. Java Code

First, we have to teach our application. In this article, we can do it by passing an arff file with the initial dataset to it. Let's retrieve this data and set a class index to a price:

     Java 
   
 
 
   // Loading Seattle real estate prices from arff file
ConverterUtils.DataSource source = new ConverterUtils.DataSource("prices.arff");
Instances data = source.getDataSet();
// Setting the last attribute (price) to the class index
data.setClassIndex(data.numAttributes() - 1); 
  

Second, we have to create a classifier. Let's use a linear regression classifier in this article:

     Java 
   
   // Creating a linear regression based classifier
Classifier classifier = new LinearRegression();

// Let's learn classifier with data
classifier.buildClassifier(data);

// Creating an Instance for predictions
Instance instance = new DenseInstance(data.numAttributes());
instance.setDataset(data);

And now, we can write a code for a price prediction depending on the area, bedrooms, and distance to the center:

     Java 
   
 
 
   public void predictPrice(double area, int bedrooms, double milesAway) throws Exception {
    // Let's ask for a price for the property:
    instance.setValue(0, area);
    instance.setValue(1, bedrooms);
    instance.setValue(2, milesAway);

    // Price predicting action
    double predictedPrice = classifier.classifyInstance(instance);
    System.out.println("Predicted price: " + predictedPrice);

    // Calculation error rate
    Evaluation eval = new Evaluation(data);
    eval.evaluateModel(classifier, data);
    System.out.println("Calculation error rate: " + eval.errorRate());
} 
  

The code mentioned in this article can be found on GitHub.

4. Running Application

So basically, we're ready to launch an application.

It consumes a file with an initial dataset in order to learn prices in Seattle, linear regression is used, and it is ready to predict prices depending on parameters (area, distance from center, bedrooms).

The application is asked to predict prices for several objects, and the output is:

-- Predicting price for [area - 2000.0 sqft, bedrooms - 4, miles away - 1.0 mi]
Predicted price: 1367914.915677933
...
-- Predicting price for [area - 2000.0 sqft, bedrooms - 4, miles away - 10.0 mi]
Predicted price: 857712.4223102874
Calculation error rate: 167943.05185960032

Analyzing Results

Prediction results are collected in the table below:

Input			OUPUT
Area, sqft	Bedrooms, bd	Miles away, mi	Predicted price, $
2000	4	1	1 367 915
2000	4	2	1 311 226
2000	4	3	1 254 537
1000	3	5	905 812
1500	2	7	557 087
2000	4	10	857 712

The predicted prices are very close to those in the real estate market in Seattle. So application predictions are pretty accurate.

The application determined that the calculation error rate is 167 943 $ (12-20%), and it also seems to be a good result.

But in some cases, predictions may differ from real prices much more. We didn't provide an application with enough data, and there is no direct relationship between price and passed parameters (area, bedrooms, and distance to the center). The price estimation process is much more complicated; we didn't consider properties like distance to a school and its rating, neighbors, condition of the real estate property, and so on and so forth.

In addition to that, the application didn't learn prices far away from Seattle at all. The farthest object from the center in the dataset is located 12.9 miles away. Let's take a look at predictions for real estate properties located 25 and 30 miles away from Seattle:

Input			OUPUT
Area, sqft	Bedrooms, bd	Miles away, mi	Predicted price, $
2500	4	25	7 375
2500	4	30	-276 070

We provided the application with such a small and non-diverse dataset that the application has learned that there is no life outside of Seattle at all. It is crucial to have accurate and relevant data when it comes to learning ML applications.

But results within Seattle itself are still accurate and correspond to the real estate market.

Conclusion

In conclusion, using Java and libraries like Weka offers a powerful and flexible approach to developing machine learning applications.

In this article, we developed an application that predicts prices in Seattle pretty accurately, and they correspond to the real estate market. But the application didn't learn about the prices outside of Seattle, and therefore results, in this case, were wrong.

That's why it is important to highlight that successful machine learning applications require careful consideration of data quality, model selection (we didn't consider schools rating, a condition of a property), and model evaluation to ensure reliable and accurate results.