Build My First AVM by Sklearn in Colab

6 min readJun 7, 2021


Nowadays bankers and real estate agents can almost immediately get a guesstimate of a house price by a Mobile-App by specifying housing attributes. It facilitates mortgage borrowers to know better how much can they borrow from the bank. A Residential Automated Valuation Model (AVM) runs a machine learning algorithm that accounts for the house’s size, number of rooms, housing quality attributes, etc. to make an up-to-date house price predictions. Simply put, it is a house price estimator (Figure 0).

Figure 0 An AVM — house price estimator by inputting housing attributes

Last week, we have discussed how to predict house prices by FB Prophet (Yiu, 2021a), which is a time series predictor by analyzing trend, seasonality and holidays. In other words, the only dimension of variations is TIME! This article, however, is a cross-sectional estimator that can build a simple AVM to estimate house prices by its housing attributes.

Traditionally, we use Hedonic Price Model, which is a statistical regression model, to identify the effect of each housing attribute on house prices. More recently, simple machine learning algorithms based on linear regression approach, such as Scikit-learn (sklearn), has been developed, which is a free library of machine learning algorithms for Python. It allows us to build our own AVM to do house price estimations. I am learning to use it and here is my first trial of building a simple AVM. To make it very simple, I simply do the following three tasks:

  1. read and scatter plot a csv data file of house prices and attributes from a google drive into Colab;
  2. make an estimate of house price by giving attributes using sklearn.LinearRegression; and
  3. compare the results with statistical linear regression.

If you do not know how to use Colab, please refer to my previous article at Yiu (2020). For more information of Scikit-learn, here is the link — or An introduction to Colab and sklearn with a simple example of making car price prediction is provided by Logallo (2021).

This simple example has only 12 routines, see whether they work or not. All lines start with a # sign are remarks only:

1. install sklearn

#install scikit-learn
! pip install scikit-learn

2. import tools and Linear Regression Algorithm

#import tools: NumPy for Advanced linear algebra, Matplotlib for Visualization and data plotting, Pandas for Data manipulation and analysis, Seaborn for heatmap plot, Sklearn for Optimization.#from sklearn import the algorithm to split the training set and testing set, import the Linear Regression algorithm to estimate the coefficients for the AVM, import r2_score to report the R-squared.import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

3. read the csv datafile from google drive (your directory name and file name can be different). You may also use other data entry methods, such as reading from url.

#This example demonstrates how to retrieve a csv file from a google drive first
#click FILE above and click 'Locate in Drive' to upload the csv file to your authorized google drive, it requires an authorization process
from google.colab import drive
#specify the drive/.../filename to readdata=pd.read_csv("drive/MyDrive/Colab Notebooks/AKL Housing Prices 2017.csv")

A csv data template file (akl_housing_prices_2017_template.csv) is available at my GREA webpage (4. Build the 1st AVM in Colab, at for download and you can add more data in it to try.

4. show the summary statistics of the data

#show the information and descriptions of the data collected
#only numeric data can be processed by regression models

5a. plot scatterplots between each attribute and price

#plot scatterplots of each attribute with price
#if dataset contains non-numeric data, then add if data[attribute].dtypes!="O"] to exclude
attribute = [col for col in data.columns]
for attribute in attribute:
sns.scatterplot(x = data[attribute], y = data['price'])
Figure 1 Scatterplots of No. of Bedrooms and Building Floor Area with House Prices.

The scatterplots show the positive correlations between Bldg_Area and Price, and most of the house size is below 150sm.

5b. plot a heatmap of all the correlation coefficients

#Plot all the Pearson Correlation Coefficients by a Heatmapplt.figure(figsize=(20,20))
cor = data.corr()
sns.heatmap(cor, annot=True,
Figure 2 a heatmap of the correlation coefficients

5c. Report a specified correlation coefficient

#Report a specified correlation coefficientprint(data[["Bedrm","price"]].corr())

6. Build a simple AVM

#Build a simple AVM by splitting the dataset into training set and testing set
#Here specify a test set of a randomized 10% of the data
X=data.drop(['price'], axis=1) #axis=1 means along the column, axis=0 means along the row
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=42)
#Machine Learning by Linear Regressionlr=LinearRegression(),y_train)

First, it defines X to be the dataset with the ‘price’ column dropped, and defines y as the ‘price’ data. The train_test_split algorithm defines 10% (0.1) random data to be the testing set: X_test, y_test. The remaining 90% data to be the training set: X_train, y_train. Then, it applies the linear regression algorithm [lr=LinearRegression()] to fit the training data [, y_train] to estimate the coefficients of each attribute.

9. Report R-squared and Compare the Actual and the Predicted (Estimate)

#Report R-squared and MSEy_predLR = lr.predict(X_test)
r2=r2_score(y_test, y_predLR)
print("R-Squared", format(r2))
mse = metrics.mean_squared_error(y_test, y_predLR)
print("Mean Squared Error {}".format(mse))
#compare the actual and the predicted in 2 decimal places (y_test, y_predLR)df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predLR})

After training the model, let’s try testing the accuracy of the AVM estimates by feeding the testing data. Now, it defines y_predLR to be the estimates of using the testing dataset [y_predLR=lr.predict(X_test)]. The explanatory power of the estimates, i.e. R-squared is reported [r2_score(y_test, y_predLR)]. The actual house prices (y_test) and the AVM estimates (y_predLR) are also tabulated to compare the accuracy.

10. plot the scatterplot comparing the Actual and the Predicted

#scatterplot the actual and the predicted
plt.scatter(y_test, y_predLR, color='red')

Here shows the results. Figure 3 shows the scatterplot of the Actual House Prices against the Predicted House Prices by sklearn ML algorithm. The R-squared is about 81.5%

Figure 3 Actual and Predicted by sklearn LR

11. Make a new estimate by giving a set of values for the attributes

#make a new predictionXnew = [[2,1,78,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0]]
ynew = lr.predict(Xnew)

After training and testing, the AVM is now ready to use. Let’s try feeding new data to see how good it is. Here shows a new data array of Xnew, representing a house of 2-bedroom, 1-bathroom, size=78sm, freehold land, built in 2010 and transacted in June 2017 in Auckland Center. The AVM estimate is $814,752. If you compare with the fourth record in the datafile above, a house with the same attribute was transacted at $755,000. The result is not bad, right?

house price estimate [814752.3465421]

12. The last step is simply a comparison with traditional statistical regression results. First, it reports the AVM’s intercept and coefficients for the attributes. Then it shows the regression results with the p-values.

#AVM Coefficients
print("Const", lr.intercept_, "Attributes Coeff", lr.coef_)
#c.f. Statistical Linear Regression Results
#X=data.drop(['price'], axis=1)
import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 =

I have uploaded my1stAVM at the following Github: for knowledge co-creation.

I also produce a Youtube at (Yiu, 2021b) to explain in more details.

[Caveat: this is just my first trial and the estimate is not sophisticated enough to discuss for real life estimations. Please do not consider the results as any advice on house price predictions.]


Logallo, N. (2021) Machine Learning with Google Colab, Medium, Apr 4.

Yiu, C.Y. (2020) Learning Machine Learning — How to Code without Learning Coding, Medium, Feb 6.

Yiu, C.Y. (2021a) Forecasting by FB Prophet in Colab, Medium, May 31.

Yiu, C.Y. (2021b) My First AVM by sklearn in Colab, Youtube, June 7.




ecyY (Edward Yiu) — easy to understand why, easy to study why. Finding the truths scientifically is the theme.