Predictive Services Development With Java and Weka
In this article, we will briefly describe how to create predictive services with Java and Weka library (Waikato Environment for Knowledge Analysis, Waikato University).
Join the DZone community and get the full member experience.
Join For FreeIn this article, we will briefly describe how to create predictive services with Java and Weka library (Waikato Environment for Knowledge Analysis, The University of Waikato).
We will develop a Java application and consider a real-life example — the prediction of real estate prices in Seattle depending on parameters (area, distance from center, bedrooms). The application will learn real estate prices, and we will analyze its predictions.
The city of Seattle was chosen randomly for this article.
Brief Introduction to Machine Learning in Java
If you never heard about machine learning, or you're simply afraid of it, please don't :)
Machine learning is a rapidly growing field of artificial intelligence (AI) that involves the development of algorithms and models that enable computers to learn and make predictions or decisions based on data. Machine learning could be used for image and object recognition, natural language processing (NLP), customer relationship management, recommendation systems, predictive maintenance, and much more.
Indeed, it is not that easy to develop everything from scratch, but Java provides a rich ecosystem of libraries and tools that can be leveraged for machine learning tasks, such as data preprocessing, feature engineering, model training, and evaluation.
Just some examples of tools that help developers to implement machine learning algorithms and build predictive models — Deeplearning4j, MLlib (Apache Spark), Smile (Statistical Machine Intelligence and Learning Engine), DL4J (Deep Learning for Java), Encog (Embedded Neural Network and Genetic Programming Framework), JOONE (Java Object Oriented Neural Engine), Weka (Waikato Environment for Knowledge Analysis).
Development of Predictive Application
In this article, we will develop an application responsible for predicting real estate prices in Seattle depending on parameters (area, distance from the center, and bedrooms).
We will use a machine learning library called Weka (Waikato Environment for Knowledge Analysis, The University of Waikato), which provides a wide range of machine learning algorithms and tools for data mining, feature selection, and model evaluation. Weka is an open-source software that is widely used in both academic and industry settings for a variety of machine-learning tasks.
1. Application Dependencies
First, in order to start using Weka in the Java application, we have to add it to the classpath. You can find the required artifacts in the Maven repository. Just add it to application properties:
- using Maven
<dependencies>
<!-- https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable -->
<dependency>
<groupId>nz.ac.waikato.cms.weka</groupId>
<artifactId>weka-stable</artifactId>
<version>3.8.6</version>
</dependency>
</dependencies>
- using Gradle
// https://mvnrepository.com/artifact/nz.ac.waikato.cms.weka/weka-stable
implementation group: 'nz.ac.waikato.cms.weka', name: 'weka-stable', version: '3.8.6'
2. Initial Dataset
Machine learning applications are mostly designed to autonomously learn from data and improve their performance over time.
But in this article, we will simply provide an application with the initial dataset, i.e. some real estate prices in Seattle. To do that, let's create an arff file and collect some data there.
% All comments have to start with '%' character % Relation name @relation prices % Fields order @attribute area numeric % area, sqft @attribute Bedrooms numeric % bedrooms, bd @attribute distance numeric % distance to the center, mi @attribute price numeric % price, $ % Declaring examples @data % Area, sqft Bedrooms Distance, mi Price, $ 3150 4 1.2 1349000 2290 3 1.2 1050000 2940 6 3.0 1850000 1107 2 3.2 729950 1122 2 6.6 599950 1350 2 6.9 598000 1390 3 12.2 449950 1660 5 12.3 619950 1540 4 12.9 1024900
Such a file could be simply placed in the application resources folder or in any other place.
There are just several records. This is enough for this article, but for real-world applications, it isn't. In order to achieve high accuracy, much more data should be provided, automated learning techniques should be used, and data should be collected continuously.
However, it's important to carefully evaluate and validate the results obtained from automated processes to ensure the reliability and accuracy of the machine learning models in real-world applications.
3. Java Code
First, we have to teach our application. In this article, we can do it by passing an arff file with the initial dataset to it. Let's retrieve this data and set a class index to a price:
// Loading Seattle real estate prices from arff file
ConverterUtils.DataSource source = new ConverterUtils.DataSource("prices.arff");
Instances data = source.getDataSet();
// Setting the last attribute (price) to the class index
data.setClassIndex(data.numAttributes() - 1);
Second, we have to create a classifier. Let's use a linear regression classifier in this article:
// Creating a linear regression based classifier
Classifier classifier = new LinearRegression();
// Let's learn classifier with data
classifier.buildClassifier(data);
// Creating an Instance for predictions
Instance instance = new DenseInstance(data.numAttributes());
instance.setDataset(data);
And now, we can write a code for a price prediction depending on the area, bedrooms, and distance to the center:
public void predictPrice(double area, int bedrooms, double milesAway) throws Exception {
// Let's ask for a price for the property:
instance.setValue(0, area);
instance.setValue(1, bedrooms);
instance.setValue(2, milesAway);
// Price predicting action
double predictedPrice = classifier.classifyInstance(instance);
System.out.println("Predicted price: " + predictedPrice);
// Calculation error rate
Evaluation eval = new Evaluation(data);
eval.evaluateModel(classifier, data);
System.out.println("Calculation error rate: " + eval.errorRate());
}
The code mentioned in this article can be found on GitHub.
4. Running Application
So basically, we're ready to launch an application.
It consumes a file with an initial dataset in order to learn prices in Seattle, linear regression is used, and it is ready to predict prices depending on parameters (area, distance from center, bedrooms).
The application is asked to predict prices for several objects, and the output is:
-- Predicting price for [area - 2000.0 sqft, bedrooms - 4, miles away - 1.0 mi] Predicted price: 1367914.915677933 ... -- Predicting price for [area - 2000.0 sqft, bedrooms - 4, miles away - 10.0 mi] Predicted price: 857712.4223102874 Calculation error rate: 167943.05185960032
Analyzing Results
Prediction results are collected in the table below:
Input | OUPUT | ||
---|---|---|---|
Area, sqft | Bedrooms, bd | Miles away, mi | Predicted price, $ |
2000 | 4 | 1 | 1 367 915 |
2000 | 4 | 2 | 1 311 226 |
2000 | 4 | 3 | 1 254 537 |
1000 | 3 | 5 | 905 812 |
1500 | 2 | 7 | 557 087 |
2000 | 4 | 10 | 857 712 |
The predicted prices are very close to those in the real estate market in Seattle. So application predictions are pretty accurate.
The application determined that the calculation error rate is 167 943 $ (12-20%), and it also seems to be a good result.
But in some cases, predictions may differ from real prices much more. We didn't provide an application with enough data, and there is no direct relationship between price and passed parameters (area, bedrooms, and distance to the center). The price estimation process is much more complicated; we didn't consider properties like distance to a school and its rating, neighbors, condition of the real estate property, and so on and so forth.
In addition to that, the application didn't learn prices far away from Seattle at all. The farthest object from the center in the dataset is located 12.9 miles away. Let's take a look at predictions for real estate properties located 25 and 30 miles away from Seattle:
Input | OUPUT | ||
---|---|---|---|
Area, sqft | Bedrooms, bd | Miles away, mi | Predicted price, $ |
2500 | 4 | 25 | 7 375 |
2500 | 4 | 30 |
-276 070
|
We provided the application with such a small and non-diverse dataset that the application has learned that there is no life outside of Seattle at all. It is crucial to have accurate and relevant data when it comes to learning ML applications.
But results within Seattle itself are still accurate and correspond to the real estate market.
Conclusion
In conclusion, using Java and libraries like Weka offers a powerful and flexible approach to developing machine learning applications.
In this article, we developed an application that predicts prices in Seattle pretty accurately, and they correspond to the real estate market. But the application didn't learn about the prices outside of Seattle, and therefore results, in this case, were wrong.
That's why it is important to highlight that successful machine learning applications require careful consideration of data quality, model selection (we didn't consider schools rating, a condition of a property), and model evaluation to ensure reliable and accurate results.
The code mentioned in this article can be found on GitHub.
Opinions expressed by DZone contributors are their own.
Comments