This article will show an example cloud solution for an end-to-end data science architecture. It’s based on Azure – there are other great vendors out there as well.
Let’s start to understand this by focusing on the “Data Lake Storage” component. We want to have all of our data in one place so that we can join data from various data sources. In this scenario, we have chosen to load everything into a Data Lake. We use Azure Data Factory to orchestrate the data loads – from the Various Data Sources.
For the data lake, we will be using ELT rather than ETL. This means that we will load the data as-is rather and than transform it as required for each use case. There are several disadvantage of with transforming the data when we load it, ETL. Some data science teams, the end users of ETL’ed data, have been hitting their head on these disadvantages for years.
– The data becomes structured for a specific purpose. For example, finance and risk reporting. Information that’s relevant for marketing and customer analytics can be lost. In the best case, the data has to be restructured again for the new purpose.
– The transformation step may have bugs. The long term data becomes permanently mangled. The accounting department may have audited the aggregate profit and loss numbers. However, the bugs come out when you try to use the data for a new purpose. The accounting team would have only checked to see if it fit for their purpose.
– The transformation step requires a development team to write the transformation code. The organisation may not have budget to write code for every field. Hence some fields may be left out.
We will building and deploying our machine learning models within the Azure Machine Learning Services framework. Given that most of the underlying model training libraries are open source, you may be wondering “why?”. Azure Machine Learning Services gives us a nice workflow, an MLOps pipeline and makes it easier to deploy our model to somewhere customer facing. We also have the option of using Azure’s AutoML. AutoML could be a good first iteration for a supervised learning use case. In some organisations, your team will consist of only software engineers. In this case, AutoML might be your best option.
The code is written on the Data Scientist’s Local Machine and the model can training either on the Local Machine or cloud infrastructure such as the Training VM in the diagram above.
Our business has some kind of customer facing Application. The Application requests and receives real-time decisions from our machine learning models. Our machine learning models can be deployed to a Kubernetes Cluster. The Kubernetes Cluster can scale the number of containers as required by the demand on our system.
Some machine learning models may be used to predict who should get an email offer or who should be contact by a relationship manager. These predictions are loaded into the data lake and also our CRM.
Our Business Users require certain specific reports, which need to be refreshed regularly. For example, the accounting department will need to know profit and loss. The marketing teams will need to know how campaigns are performing. We can serve these reports as Dashboards. We will probably have some kind of Dashboard Server to serve and control access to the Dashboards from one central point.
The Dashboards will need specific, aggregated data. The data will need to be accessed quickly. For this purpose, we store certain materialised views in the Azure SQL Data Warehouse.
Why should we bother with all of this infrastructure? Our next article will present a SWOT analysis of data science projects in large organisations.
Do you want to see real live examples of data science architectures in large organisations? Reserve you your place at the Enterprise Data Science Architecture Conference because that is the focus of the conference.
The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because you must keep you skills current.
Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.
- Mitigating Information Asymmetry – Unsupervised vs Supervised Learning - 31 January 2020
- Data Virtualisation – The Value Add - 22 January 2020
- SWOT Analysis of Data Science Projects in Large Enterprises - 6 January 2020