Data Analytics — Coding in Colab

Having written many articles on analytics, I want to make an index article summarising the whole picture and their relationships. First, data analytics is defined as follows:

“Data analytics is the pursuit of extracting meaning from raw data using specialized computer systems. These systems transform, organize, and model the data to draw conclusions and identify patterns.” (Informatica, 2021)

It therefore involves many processes, including data acquisition, data processing, data visualisation, data storage, data analysis, data forecasting, data simulations, predictive analytics by machine learning and artificial intelligence, decision making, etc. (Figure 1)

Figure 1 Data Analytics Processes. source: by the author

Each process can involve different software and different skills. Figure 2 shows some of the common software and apps for the process. For example, parsehub can do automatic data scrapping from webpages, parabola can automatically download data from API, Excel is powerful in data processing, which provides Power Query to handle big data. Data analysis can be divided into numerical data analysis and geographical data analysis, the former can have many powerful statistical or econometric software such as EViews, STAT, SPSS, the latter includes ArcGIS and QGIS, to name just a few. Data visualisation can have a lot of different software, such as Power BI and tableau. Data storage can include Google Cloud and Microsoft Dropbox, etc. Anylogic is good at doing simulations, Azure, Colab and Jupyter are developing platforms for machine learning and artificial intelligence.

Figure 2 Examples of different software and apps for different data analytics processes

It can cause a lot of communication and learning problems when one has to learn and link so many different software and apps. Sometimes we have to use several different software to accomplish just one goal. I just have a case that involves using ArcGIS to geocode a big dataset, then using Excel’s Power Query and pivot table to tabulate and extract the required data, and then importing into EViews to carry out econometric analysis.

That’s is the reason why I start to use Google Colab to try doing data analytics tasks. First, it does not require local data storage facilities as all processing can be done at the CPU and GPU provided by Colab and the files can be saved at the Google Drive / Cloud or Github. It also facilitates co-creation of knowledge and sharing of experience. Better still, it becomes a common platform to use different Python libraries. I have shown using OpenCV for face detection, Numpy for linear algebra, Pandas for panel data analysis, Geopandas for GIS analysis, Contextily for geographical data visualisation, Matplotlib for numerical data visualisation, FbProphet for forecasting, Tensorflow.Keras for machine learning (Yiu, 2021f), Scikit-learn for optimisation, etc. (Figure 3)

Figure 3 Tools and space accessible by Google Colab

