Applying Artificial Neural Networks to Predict Housing Price: An Illustration

Artificial Intelligence (AI) emphasizes about machine learning (ML) which makes use of Artificial Neural Networks (ANN) as one of the major approaches. (Figure 1)

ANN is defined as a framework for many different machine learning algorithms to work together and process complex data inputs, which “learn” to make predictions by “training”, without being programmed with any task-specific rules, as mentioned in my previous article [1]. As there have been lots of articles and videos that teach about ANN, I am not going to repeat them again here. Instead, I am going to show the results of predictions by applying ANN on housing price estimation. As there are now existing softwares that can carry out ANN without caring about the mathematical complexity and programming details, I am only going to discuss the results by referring to one of the existing softwares, XLSTAT, because it provides the illustrations on web using Boston house-price data of Harrison and Rubinfeld (1978)[2] The data set is downloadable at http://lib.stat.cmu.edu/datasets/boston, which includes about 380 records with the following 13 explanatory variables and 1 dependent variable MEDV (Median Value of Owner-occupied Home in \$1000's):

`Variables in order: CRIM     per capita crime rate by town ZN       proportion of residential land zoned for lots over 25,000 sq.ft. INDUS    proportion of non-retail business acres per town CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) NOX      nitric oxides concentration (parts per 10 million) RM       average number of rooms per dwelling AGE      proportion of owner-occupied units built prior to 1940 DIS      weighted distances to five Boston employment centres RAD      index of accessibility to radial highways TAX      full-value property-tax rate per \$10,000 PTRATIO  pupil-teacher ratio by town B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town LSTAT    % lower status of the population MEDV     Median value of owner-occupied homes in \$1000's`

Figure 2 Variables of the Boston Housing Value Predictions. Source: http://lib.stat.cmu.edu/datasets/boston

Traditionally, economists normally make use of Hedonic Pricing Model (Regression) to estimate the effects of these explanatory variables on housing price. However, it requires a correct specification of the regression model and it assumes the effects of variables are independent and each variable’s effects are linear, unless they are specially specified.

On the contrary, ANN allows deep learning (multiple layers) and multiple neurons (nodes) to “learn” about the interactive effects of these variables on housing prices. Yet, this Boston housing-price example is not perfect as it still requires human beings to pre-set these 13 explanatory variables, which may result in omission bias and specification bias.

For example, the study focuses on the effects of air quality on housing prices, but it only includes nitric oxide concentration, without taking into account other pollutants, such as SO2, NO2, PM2.5 and PM10. The home buyers may care a lot of the effects of all these pollutants, but their effects could not be reflected in the estimations. A true machine learning should be able to collect and determine its own dataset in the analysis.

Anyway, just for illustration sake, let’s start running the ANN. XLSTAT is an add-on of EXCEL spreadsheet. Figure 3 shows the ANN with the 13 variables in the INPUT LAYER, and 2 HIDDEN LAYERS with 5 NODES and 3 NODES respectively; and there is only one node in the OUTPUT LAYER for the predictions of housing value, MEDV.

The numeric figures in Figure 3 indicate the final weights estimated for the best fit line of the relationship between the nodes. The programme of the ANN is to adjust the weights in each reiteration in the BACK-PROPAGATION process so as to reduce the errors (the difference between the actual price and the predicted price) until the errors are smaller than the designated threshold (0.01 in this case) then it stops. (Figure 4)

After carrying out this TRAINING process, it LEARNS how to predict housing prices in Boston, based on these 13 variables. The training allows the ANN to make reasonably accurate predictions for this dataset, but it does not guarantee a good prediction in the next attempt, because the effects of variables on housing price are dynamic and may change over time. In other words, even if it can be 100% correctly predicting their effects in last month’s transactions, it does not imply that the same effects will be valid this year. Yet, this is one of the reasons why AI can outperform human beings. Normally, after doing several hedonic pricing analyses, we would stop doing new ones and assume that the effects of each variable are fixed and do not change over time. Unfortunately, such an assumption has been found to be wrong. An AI system can feed in every new data into the system automatically to update its prediction accuracy, and even if the effects are changing over time, the AI system can update the changes continuously and in real time.

There have been some studies comparing the accuracy of ANN against Hedonic Pricing Model. For example, there was a comparison study on New Zealand Housing Markets. [3] However, it is interesting that the scholars tried to compare ANN results with hedonic model results, which seems to assume that hedonic model results are correct (which I doubt). A better way is to let the ANN trained system to predict future real transaction data, and feedback the accuracy of the predictions to refine the model. Unfortunately, so far very few studies adopt this approach to compare ANN’s and Hedonic Model’s performance.

Theoretically, when an ANN is of ZERO HIDDEN LAYER, then it is equivalent to a HEDONIC PRICING MODEL of linear regression (you may try running it yourself to testify). Both are trying to estimate the best fit line by the least square method. But when the ANN contains HIDDEN LAYERS, then the interactive effects among the variables are estimated. Thus, if there are strong and complex interactive effects among variables, then ANN should outperform Hedonic Pricing Model, can anyone suggest how to test this hypothesis?

References

[1] Yiu, C.Y. (2018) From Automation to Machine Learning, Jan 3. https://medium.com/@edwardyiu/from-automation-to-machine-learning-c61fefe483f5

[2] Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air, Journal of Environmental Economics & Management, 5, 81–102.

[3] Limsombunchai, V., Gan, C. and Lee, M. (2004) House Price Prediction: Hedonic Price Model vs. Artificial Neural Network, American Journal of Applied Sciences 1 (3): 193–201. https://thescipub.com/pdf/10.3844/ajassp.2004.193.201

Written by