Piloting machine learning at speed – Utilizing Google Cloud and AutoML
Can modern machine learning tools do one-weeks work in an afternoon? The development of machine learning models has traditionally been a very iterative process. The traditional machine learning project starts with the selection and pre-processing of data sets: cleaning and pre-processing. Only then can the actual development work of the machine learning model be started.
It is very rare, virtually impossible, for a new machine learning model to be able to make sufficiently good predictions on the first try. Indeed, development work traditionally involves a significant number of failures both in the selection of algorithms and their fine-tuning, in technical language in the tuning of hyperparameters.
All of this requires working time, in other words, money. What if, after cleaning the data, all the steps of development could be automated? What if the development project could be carried through at an over-paced sprint per day?
Machine learning and automation
In recent years, the automation of building machine learning models (AutoML) has taken significant leaps. Roughly described in traditional machine learning, the Data Scientist builds a machine learning model and trains it with a large dataset. AutoML, on the other hand, is a relatively new approach in which the machine learning model builds and trains itself using a large dataset.
All the Data Scientist needs to do is tell you what the problem is. This can be a problem with machine vision, pricing or text analysis, for example. However, Data Scientists will not be unemployed due to AutoML models. The workload shifts from fine-tuning the model to validating and using Explainable-AI tools.
Google Cloud and AutoML used to sole a practical challenge
Some time ago, we at Codento tested Google Cloud Platform’s (GCP) AutoML-based machine learning tools . Our goal was to find out how well GCP’s AutoML tool solves the Kaggle House Prices – Advanced Regression Techniques challenge .
The goal of the challenge is to build the most accurate tool possible to predict the selling prices of real estates based on their properties. The data set used in the building of the pricing model contained data on approximately 1,400 real estates: In total 80 different parameters that could potentially affect the price, as well as their actual sales prices. Some of the parameters were numerical, some were categorical.
Building a model in practice
The data used was pre-cleaned. The first phase of building the machine learning model was thus completed. First, the data set, a file in csv format, was uploaded as is to GCP’s BigQuery data warehouse. The download took advantage of BigQuery’s ability to identify the database schema directly from the file structure. The AutoML Tabular feature found in the VertexAI tool was used to build the actual model.
After some clicking, the tool was told which of the price predictive parameters were numeric and which were categorical variables. In addition, the tool was told which column contains the predicted parameter. It all took about an hour to work. After that, the training was started and we started waiting for the results. About 2.5 hours later, the GCP robot sent an email stating that the model was ready.
The final result was a positive surprise
The accuracy of the model created by AutoML surprised the developers. GCP AutoML was able to independently build a pricing model that predicts home prices with approximately 90% accuracy. The level of accuracy per se does not differ from the general level of accuracy of pricing models. It is noteworthy here, however, that the development of this model took a total of half a working day.
However, the benefits of GCP AutoML do not end there. It would be possible to integrate this model with very little effort into the GCP data pipeline. The model could also be loaded as a container and deployed in other cloud platforms.
Approach which pays off in the future as well
For good reason, tools based on AutoML can be considered the latest major development in machine learning. Thanks to the tools, the development of an individual machine learning model no longer has to be thought of as a project or an investment. Utilizing the full potential of these tools, models can be built with an approximately zero budget. New forecasting models based on machine learning can be built almost on a whim
However, the effective deployment of AutoML tools requires a significant initial investment. The entire data infrastructure, data warehouses and lakes, data pipelines, and visualization layers, must first be built with cloud-native tools. Codento’s certified cloud architects and data engineers can help with these challenges.
GCP AutoML, https://cloud.google.com/automl/
Kaggle, House Prices – Advanced Regression Techniques, https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/
The author of the article is Jari Rinta-aho, Senior Data Scientist & Consultant, Codento. Jari is a consultant and physicist interested in machine learning and mathematics, with extensive experience in utilizing machine learning in nuclear energy. He has also taught physics at several universities and led international research projects. Jari’s interests include ML-Ops, AutoML, Explainable AI and Industry 4.0.
Ask more about Codento’s AI and data services: