Predicting Housing Prices Using Google AutoML Tables

Introduction to Google Cloud AutoML Tables, which can train both regression and classification models depending on the type of column we are trying to predict.

Mudit Jain

Preetam Joshi

Apr. 01, 21 · Tutorial

Likes (8)

Comment

Save

5.5K Views

Overview of Problem

Tabular data is quite common in various business and engineering problems. Machine learning can be used to predict particular columns of the table we are interested in, using other columns as input features. We will take an example of using historical house sales data to predict sales prices for houses that come on the market in the future. The house prices dataset from Kaggle contains such data for Ames, Iowa. It contains predictive columns like house area, neighborhood area name, type of building, house style, condition, year last sold, etc., among a total of 79 such predictive features. Some of these features are categorical while others are numerical and our goal is to predict the Sale Price (a numeric column) of houses using these features.

Id	LotArea	Neighborhood	BldgType	Style	Cond	YrBuilt	1stFlrSF	2ndFlrSF	Fireplaces	YrSold	SalePrice
1	8450	CollgCr	1Fam	2Story	5	2003	856	854	0	2008	208500
2	9600	Veenker	1Fam	1Story	8	1976	1262	0	1	2007	181500
3	11250	CollgCr	1Fam	2Story	5	2001	920	866	1	2008	223500
4	9550	CollgCr	1Fam	2Story	5	1915	961	756	1	2006	140000
5	14260	NoRidge	1Fam	2Story	5	2000	1145	1053	1	2008	250000
6	14115	Mitchel	1Fam	1.5Fin	5	1993	796	566	0	2009	143000
7	10084	Somerst	1Fam	1Story	5	2004	1694	0	1	2007	307000
8	10382	NWAmes	1Fam	2Story	6	1973	1107	983	2	2009	200000
9	6120	OldTown	1Fam	1.5Fin	5	1931	1022	752	2	2008	129900
10	7420	BrkSide	2fmCon	1.5Fin	6	1939	1077	0	2	2008	118000
11	11200	Sawyer	1Fam	1Story	5	1965	1040	0	0	2008	129500
12	11924	NridgHt	1Fam	2Story	5	2005	1182	1142	2	2006	345000
13	12968	Sawyer	1Fam	1Story	6	1962	912	0	0	2008	144000

Overview of Google AutoML Tables

Google AutoML Tables enables quick and high accuracy training and subsequent hosting of ML models for such a problem. Users can import and visualize the data, train a model, evaluate it on a test set, iterate on improving model accuracy and then host the best model for online/offline predictions. All of the above functionality is available as a service without any ML expertise or hardware or software installation required from users.
AutoML table can train both regression and classification models depending on the type of column we are trying to predict.

Initial Setup

We first log in to our Google Cloud Platform (GCP) account (or create it if we don't have one) and create a new project. Then enable AutoML Tables by selecting 'Tables' and enabling the API as shown below.

Importing Data

To import data, we go to the Import tab and select the source type, i.e., either CSV or BigQuery table. In our case, we will upload the 'train.csv' of the housing prices dataset that we downloaded earlier. If an upload destination GCS bucket doesn't already exist, we can create a single region bucket, e.g., 'gs://house_prices_dataset_1'. AutoML Tables will import the data and auto analyze it to validate it and detect the datatypes of columns.

Exploring Data

We can explore the imported data schema once import completes. AutoML will show the column names, data type (i.e., category, numeric, or text), missing values, and distinct values for each column. We should set the prediction target column and in our case, we'll predict the SalePrice column. To enable us to understand how valuable each feature is individually, it also generates correlation scores of each column with the target column. In addition, we can explore the distribution of values in each column.
In some cases, it's possible that feature datatype is incorrectly detected as numeric when it is actually categorical if the category values are numeric instead of text, for example, Year Sold. We can override the type in case of such mistakes.

Training

We can now select the input columns and certain data and training parameters. We first remove the ID column as a feature, since it's a unique identifier of the row and not a feature. We then specify the train, validation, and test dataset split that AutoML will use during training. This can be set to happen either automatically(randomly) or we can specify Train/Validation/Test set rows with an additional column. Next, we can set whether any column should be considered as a weight column. This will give higher importance to certain rows and is helpful if we want our model to be more accurate for certain subsets of data, for example for certain house types or regions.

In advanced features, we can select the duration for which we should train the model. In case the model converges earlier, AutoML will automatically stop training ('Early stopping') before our specified duration. Since our dataset is small, we can select a budget of just 1 hour.

Finally, we can select objective functions from among:

RMSE (Root Mean Square Error) — This objective function is used when large deviations or small relative deviations on large values should matter more.
MAE (Mean Absolute Error) — This is similar to RMSE but large deviations matter slightly less since we take the absolute difference (L1) instead of squared difference (L2) as in RMSE.
RMSLE (Root Mean Square Log Error) — This objective function is used when we want to treat large and small scale deviations equally since we take a log of predictions and ground truth.

Testing and Results

Training the model can take between 30 min to 20 hours depending on the budget specified and convergence of training. Once completed, we can see the results of our regression model where we are able to predict within 14% of the actual price on average (MAPE) on the test split of our training data. With more training data, this error can be further reduced.

We can also see an importance score for each of the features. We see that 'Ground Living Area' is the most important feature. Other top features are 'Lot Area,' 'Open Porch Area,' '1st Floor Area,' 'Quality of House,' 'Year Built,' 'Year Remodelled,' etc., which are quite intuitive and indicate that model is learning the correct features to predict sales price.

Prediction

We can use this model in 3 modes for prediction on new data:

Online prediction — In online mode, we can issue live requests to our model, e.g., from a production service. The model is hosted by AutoML which will replicate the model and deliver a high-availability and low-latency SLO. For this mode, the model needs to be deployed.
Batch prediction — In batch mode, the model can be run for one-off jobs by AutoML to predict over a bigger batch of data we already have. There is no need to deploy the model and is thus cheaper than online mode.
Self-hosted — We can export a docker image of the model and host it on our own VMs and containers. In this mode, we will be responsible for the reliability and maintenance of the model. This mode is useful if the model needs to be used on-prem for predicting data that can not leave an on-prem environment or if the costs of using AutoML online/batch prediction are too high.

Conclusion

To conclude, in this article, we showed how AutoML Tables is a great tool to train and host a good quality ML model for tabular data while requiring only minimal knowledge of ML/AI and no efforts for setting up training and hosting environments.

AutoML Tables take care of these requirements for you and provides you with 'Automatic ML.'

Database Data (computing) Machine learning Google (verb) Housing (engineering)

Opinions expressed by DZone contributors are their own.

Related

Trending