Build My First AVM by Sklearn in Colab

Figure 0 An AVM — house price estimator by inputting housing attributes
#install scikit-learn
! pip install scikit-learn
#import tools: NumPy for Advanced linear algebra, Matplotlib for Visualization and data plotting, Pandas for Data manipulation and analysis, Seaborn for heatmap plot, Sklearn for Optimization.#from sklearn import the algorithm to split the training set and testing set, import the Linear Regression algorithm to estimate the coefficients for the AVM, import r2_score to report the R-squared.import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
#This example demonstrates how to retrieve a csv file from a google drive first
#click FILE above and click 'Locate in Drive' to upload the csv file to your authorized google drive, it requires an authorization process
from google.colab import drive
drive.mount('/content/drive/')
#specify the drive/.../filename to readdata=pd.read_csv("drive/MyDrive/Colab Notebooks/AKL Housing Prices 2017.csv")
data.head()
#show the information and descriptions of the data collected
#only numeric data can be processed by regression models
data.info()
data.describe()
#plot scatterplots of each attribute with price
#if dataset contains non-numeric data, then add if data[attribute].dtypes!="O"] to exclude
attribute = [col for col in data.columns]
attribute
for attribute in attribute:
sns.scatterplot(x = data[attribute], y = data['price'])
plt.show()
Figure 1 Scatterplots of No. of Bedrooms and Building Floor Area with House Prices.
#Plot all the Pearson Correlation Coefficients by a Heatmapplt.figure(figsize=(20,20))
cor = data.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
Figure 2 a heatmap of the correlation coefficients
#Report a specified correlation coefficientprint(data[["Bedrm","price"]].corr())
print(data[["Bldg_Area","price"]].corr())
#Build a simple AVM by splitting the dataset into training set and testing set
#Here specify a test set of a randomized 10% of the data
X=data.drop(['price'], axis=1) #axis=1 means along the column, axis=0 means along the row
y=data['price']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,random_state=42)
#Machine Learning by Linear Regressionlr=LinearRegression()
lr.fit(X_train,y_train)
#Report R-squared and MSEy_predLR = lr.predict(X_test)
r2=r2_score(y_test, y_predLR)
print("R-Squared", format(r2))
mse = metrics.mean_squared_error(y_test, y_predLR)
print("Mean Squared Error {}".format(mse))
#compare the actual and the predicted in 2 decimal places (y_test, y_predLR)df = pd.DataFrame({'Actual': y_test, 'Predicted': y_predLR})
df.round(2)
#scatterplot the actual and the predicted
plt.scatter(y_test, y_predLR, color='red')
plt.show()
Figure 3 Actual and Predicted by sklearn LR
#make a new predictionXnew = [[2,1,78,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0]]
ynew = lr.predict(Xnew)
print(ynew)
house price estimate [814752.3465421]
#AVM Coefficients
print("Const", lr.intercept_, "Attributes Coeff", lr.coef_)
#c.f. Statistical Linear Regression Results
#X=data.drop(['price'], axis=1)
#y=data['price']
import statsmodels.api as sm
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 = est.fit()
print(est2.summary())

--

--

ecyY — easy to understand why, easy to study why. Finding the truths scientifically is the theme.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store