Mitigating Information Asymmetry – Unsupervised vs Supervised Learning

The distinction between “supervised” and “unsupervised” learning has existed for a long time. Only recently however, has there been such a storm of misunderstanding about what these terms actually mean. Read this article to help you discern which consultants your organisation needs to fire.

Learn to discern the wolves

A supervised learning problem has labels and features. Labels are what you want to predict. Features are the inputs that you will have access to at the time when you make a prediction. What you are asking is “learn to predict these outcomes.” We have some examples below to make it concrete.

  • A real estate automated valuation model (AVM). The labels are the prices at which houses have previously sold for. The features are location, number of bedrooms, number of bathrooms, land size, …etc.
  • A credit risk model. The labels are whether or not a customer is deemed to have defaulted on their loan. The features are time with bank, number of credit enquiries, income, living expenses…etc.
  • Sales forecasting model. The labels are past sales numbers. The features are also past sales numbers. This is a time series forecasting example.
  • Image recognition. The labels are the description of what is in the image. These are usually discrete categories. For example, “cat”, “hot dog”, “tiger” and “number 7”. The features are the images.

A unsupervised learning problem does not have labels. Only features. What you are asking is “find me some interesting patterns in this data.” We have to think harder to come up with some good examples – see below.

  • Customer feedback text clustering. Our company has a free text feedback form. We take customers’ text feedback and assign it to several groups. We then read a few responses from each group to understand the nature of the feedback in that group.
  • Suppose that we don’t have house prices, but we still want to see what kind of houses we can buy. The features are location, number of bedrooms, number of bathrooms, land size, …etc. So we investigate further and examine how big the land size is in different suburbs.
  • Customer segmentation. Assign customers to segments. The marketing team uses the segments to plan their marketing strategy.

How can we implement these systems in a large organisation? Check out our example data science architecture.

Productionise Properly – come see how it’s done. The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because this is the right place to meet the right people and up-skill.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.

Data Virtualisation – The Value Add

The problem that we are solving: Organisations have several places where different pieces of data are stored. We need information from each source to paint a complete picture that adds business value.

Imagine that we have the following data sources in our organisation:

  • The company data warehouse, with most of the information required by the finance and risks teams. However it’s missing the fields that we need to build a cross-sell model.
  • The database with historical customer transactions – essential to your data science project. It’s owned by marketing and your team can’t have access.
  • The finance team’s special database. The company can’t calculate EBIT without it.
  • The company CRM.
  • The company CRM for B2B customers.
  • The web analytics data store. The web analytics team are kind enough to provide your team with monthly extracts.
  • The company data lake. It stores outputs from your team’s machine learning models. Some source system data has been loaded as well.

If we could have access to all of these pieces of information then we could build the best machine learning models, report the deepest insights and place our company firmly in first place. But how? Our data science team can’t get access to most of those databases. Copying them into the data lake is an ongoing two year project. Data virtualisation could help.

The data virtualisation software will connect to and query our data sources. Our data users will connect to and query the data virtualisation software as if it were any other database. They will be able to query and join all of the data across all of the data sources.

Users will only need to apply for access to one system and their credentials only need to be removed from one system if they leave the company. The data virtualisation software may also be able to mask certain sensitive fields for certain types of users. For example, we can hide customer names from teams who don’t need to know them.

Data virtualisation is one piece of the of the picture. What you do with the data makes the difference between best practices and wasting money. As a specialist, you would have seen countless examples of adding value: increasing profit, saving lives, managing risk, automating manual labour. On the other hand, if you are a non-specialist, check out our example for non-specialists.

Productionise Properly – come see how it’s done. The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because this is the right place to meet the right people and up-skill.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.

SWOT Analysis of Data Science Projects in Large Enterprises

Your organisation has started its data science journey – what should you watch out for? This broad brush analysis will hopefully resonate with your experience. You will see these Strengths, Weaknesses, Opportunities and Threats (SWOT) in most large organisations, on most data science projects. The leaders in the field are overcoming the weaknesses and threats described below – some of these leaders will be presenting at the Enterprise Data Science Architecture Conference.

Strengths: The strength of a large and established enterprise is its current business. Your organisation has built up its business over a long time. You have an established customer base, established processes, economies of scale and brand equity. If you are a bank, then you probably have a competitive advantage in cost of funds.

Weaknesses: To work properly, data science must be integrated into your organisation. You will need new processes, new teams, new specialised roles, new ways of working, new infrastructure. The data science architecture as presented in the previous article, is radically different to what most large organisations currently have implemented. For example, many ASX50 companies do not have real time personalisation on their websites. Change takes time in large organisations. The right companies are building their data science platforms right now.

Information asymmetry. Data science is a new field and everyone is a self proclaimed expert. Senior leaders who have come up through the business side need to sift through the salespeople – despite the information asymmetry. Which consultants to engage? Whom to hire? Whom to promote? Whose experience is relevant?

You will also need to recruit leaders to take your company on this journey. These leaders will need to have relevant skills and experience. Managing an older style BI team is not relevant experience. This is the opportunity someone who understands the latest technology to level up. Meet the right leaders at the Enterprise Data Science Architecture Conference.

Opportunities: Properly productionised data science can increase the profitability of any large enterprise. Although some companies are further ahead than others, everyone is just beginning their journey. This is the opportunity to pull ahead of your competitors – if you get it right. Employees at all levels have the opportunity to grow their careers by building up a track record with the right experience.

Threats: If your competitors go further along the data science journey by a meaningful amount, then you will lose market share. Properly productionised data science can be an unfair advantage against an unprepared competitor. For example, your competitor could send a just-in-time retention offer before you have even noticed that you have acquired a new customer.

Career threats for technical staff. Working on successful, cutting edge projects is career gold. Working with stale technologies and irrelevant KPIs is career death because you will be de-skilling yourself. Team members who understand this will leave to work on cutting edge projects. On the other hand, team members who work on cutting edge projects will also leave when they find higher paying jobs.

Career threats for management staff. Building a track record as a leader in a leading company is a great career boost. However, if you recruit the wrong team, the implementation of data science solutions in your company may fall in the wrong direction.

The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because this is the right place to meet the right people and up-skill.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.

Data Science Architecture – An Azure Example

This article will show an example cloud solution for an end-to-end data science architecture. It’s based on Azure – there are other great vendors out there as well.

Data Science Architecture - An Azure Example
Data Science Architecture – An Azure Example

Let’s start to understand this by focusing on the “Data Lake Storage” component. We want to have all of our data in one place so that we can join data from various data sources. In this scenario, we have chosen to load everything into a Data Lake. We use Azure Data Factory to orchestrate the data loads – from the Various Data Sources.

For the data lake, we will be using ELT rather than ETL. This means that we will load the data as-is rather and than transform it as required for each use case. There are several disadvantage of with transforming the data when we load it, ETL. Some data science teams, the end users of ETL’ed data, have been hitting their head on these disadvantages for years.
– The data becomes structured for a specific purpose. For example, finance and risk reporting. Information that’s relevant for marketing and customer analytics can be lost. In the best case, the data has to be restructured again for the new purpose.
– The transformation step may have bugs. The long term data becomes permanently mangled. The accounting department may have audited the aggregate profit and loss numbers. However, the bugs come out when you try to use the data for a new purpose. The accounting team would have only checked to see if it fit for their purpose.
– The transformation step requires a development team to write the transformation code. The organisation may not have budget to write code for every field. Hence some fields may be left out.

We will building and deploying our machine learning models within the Azure Machine Learning Services framework. Given that most of the underlying model training libraries are open source, you may be wondering “why?”. Azure Machine Learning Services gives us a nice workflow, an MLOps pipeline and makes it easier to deploy our model to somewhere customer facing. We also have the option of using Azure’s AutoML. AutoML could be a good first iteration for a supervised learning use case. In some organisations, your team will consist of only software engineers. In this case, AutoML might be your best option.

The code is written on the Data Scientist’s Local Machine and the model can training either on the Local Machine or cloud infrastructure such as the Training VM in the diagram above.

Our business has some kind of customer facing Application. The Application requests and receives real-time decisions from our machine learning models. Our machine learning models can be deployed to a Kubernetes Cluster. The Kubernetes Cluster can scale the number of containers as required by the demand on our system.

Some machine learning models may be used to predict who should get an email offer or who should be contact by a relationship manager. These predictions are loaded into the data lake and also our CRM.

Our Business Users require certain specific reports, which need to be refreshed regularly. For example, the accounting department will need to know profit and loss. The marketing teams will need to know how campaigns are performing. We can serve these reports as Dashboards. We will probably have some kind of Dashboard Server to serve and control access to the Dashboards from one central point.

The Dashboards will need specific, aggregated data. The data will need to be accessed quickly. For this purpose, we store certain materialised views in the Azure SQL Data Warehouse.

Why should we bother with all of this infrastructure? Our next article will present a SWOT analysis of data science projects in large organisations.

Do you want to see real live examples of data science architectures in large organisations? Reserve you your place at the Enterprise Data Science Architecture Conference because that is the focus of the conference.

The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because you must keep you skills current.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.

Adding Value with Data Science  -  An Example for Non-Specialists

You may be wondering “What is this data science thing and how does it help my business?” You may have built a career in an established branch of IT. You may be a business professional with a solid grounding in what makes your business work. Many non-specialist articles and presentations have made general statements such as “AI will revolutionise industry [X]”. In other words, “all fluff”.

This article cuts through the hype and provides a concrete example of how data science can be used to add business value. No fluff.

We will walk through a hypothetical scenario that is likely to be profitable if implemented. The terms “AI”, “machine learning” and “data science” are used as synonyms.

Here’s our scenario: We run a sports betting business. Customers can place bets on our website, mobile app or in our physical store. If we can recommend the right sports and bet types, at the right time, to the right customer, then we can increase the number of bets that a customer places. More bets leads to more turnover and lower variance, which lead to higher profits. 

But why use machine learning? If each customer were to bring us a seven figure profit each year, then we can afford a relationship manager for each customer. No need for machine learning. However, each customer only brings in a few dollars of profit. We aim to spend a few cents per recommendation, per customer, to earn an additional dollar per customer. If we get this right, then we can spend a few cents to earn an extra doller per customer. Each customer would spend more money with us and our total profits would increase. So how do we do it?

An end-to-end solution

Our company already has the following components:

  • An existing system that takes bets.
  • Archives of past bets, past pages, past marketing contact activity and anything else.
  • A marketing team who need to track the ROI of this project
  • User facing front ends; desktop, mobile, in store.

Our solution will add the following components:

  • A micro service that will serve recommendations to the front ends.
  • A modification to the front to display the recommendations that they are instructed to display.
  • A machine learning model that decides which recommendations to serve and to whom.
  • Dashboards for reporting campaign performance and ROI.
  • An end-to-end pipeline for building machine learning models, deploying them and tracking their ROI versus control groups. These pipelines are sometimes referred to as MLOps.

How to integrate all of these components together? This is a massive topic that we will start exploring in the next article.

The Enterprise Data Science Architecture Conference will present real solutions that have been deployed in large companies. I invite you to reserve your place now because it is the best place to learn the emerging best practices.

Machine Learning in Production. The Enterprise Data Science Architecture Conference focuses on how to properly productionise data science solutions at scale. We have confirmed speakers from ANZ Bank, Coles Group, SEEK, ENGIE, Latitude Financial, Microsoft, AWS and Growing Data. The combination of presentations is intended to paint a complete picture of what it takes to productionise a profitable data science solution. As an industry, we are figuring out how to best build end-to-end machine learning solutions. As the field matures, knowledge of best practices in end-to-end machine learning pipelines will become essential skills. I invite you to view our list of confirmed speakers and talks at https://edsaconf.io because you must keep you skills current.

Meet the right people and up-skill. The conference will be on the 27th March at the Melbourne Marriott Hotel. A fully catered conference with coffee, lunch, morning/afternoon tea and evening drinks & canapes. I invite you to reserve your place at https://edsaconf.io this is the best place to learn the emerging best practices.