Do You Really Need Machine Learning?

By Lindsay Paterson

Machine Learning.

Companies and tech blogs are raving about how “artificial intelligence” is the future and how they’re going to apply “machine learning” to improve tech and excel above the competition. But what actually is machine learning, how do you use it, and is it just the big buzzword of 2017?

Short answer: most of the time, yes – but where it is useful, it can be revolutionary.

So what is machine learning? In its most raw form, machine learning is the art of function approximation, or making an educated guess. It’s the same concept as a professional, for example, a plumber, having the experience to look at a leak in a house and guess what caused the leak quickly and accurately. In machine learning, we call this experience “big data.” With each individual problem the plumber sees and solves she gets a new “data point” and can use this knowledge to solve similar problems she encounters in the future.

The approximated function (red) is “close enough” to the pattern of the data (blue)

All this seems great, but the recent jump on machine learning, and why I dare to call it a buzzword, is for good reason; machine learning is almost never the answer. It easily complicates problems that could be more easily solved – there’s no reason to reinvent the for-loop! Most companies employing “machine learning” today either aren’t actually using machine learning techniques and are labelling normal algorithm development as such for marketing purposes, or are producing overly complex, computationally intensive, expensive, and unnecessary solutions to problems that are often worse than had they just solved the problem by normal means.

This doesn’t mean that machine learning is never useful though. In fact, it can be an incredible tool when applied correctly to a valid problem. But what makes a valid problem? Although not an end-all for defining a machine learning problems, here’s a handy checklist for determining whether a problem is worthwhile to take a machine learning approach, or would be better left to a standard analytical solution.

A Machine Learning Problem:

  • Has “Big Data” – many data points (a big project may not see good results without several million data points!)
  • Is Complex – hard if not impossible to solve by standard methods, often requires an expert in the field
  • Is Nondeterministic – the same input doesn’t guarantee the same output
  • Is Many-Dimensional – the rule of thumb is to have at least 9 different categories or dimensions to get your data points from to make machine learning worthwhile (or in math speak, n>=9)

Some popular examples of machine learning problems that fit this checklist include: medical image processing, product recommendation, understanding speech, text analysis, facial recognition, search engines, autonomous vehicles, augmented reality, and predicting human behaviour.

One of the biggest challenges of machine learning is handling nondeterminism in a system (ie. the same input does not guarantee a consistent output). This is best explained with an example that we’ll carry through the rest of this post; trying to predict the weather in Toronto. We have big data – the history of weather in Toronto for hundreds of years. The issue is complex – accurate forecasting requires experts with training and experience in the science of weather. The problem is nondeterministic – just because the weather of February 23rd, 2016 was cold doesn’t mean the weather of February 23rd, 2017 will be the same, even though they share the same historical data. The problem is also many dimensional – wind patterns, rain patterns, and every factor that affects the weather can be taken as a new dimension to the problem. Since this problem is nondeterministic, we have to do our best to predict the output of the system (forecast the weather) using the information we do have – we make our best guess.

For machine learning, our best guess or function approximation, is almost always just a creative use of math – whether it be statistics/probability, vectors, optimization, or other mathematical methods. There’s a few core types of machine learning problems that can help identify what kind of solution best fits a problem: classification, regression and clustering. In our example, we’re looking at a regression problem – trying to predict continuous trends in data. There’s a few core ways of training the system, or providing experience for the system to learn from, that are used as well: supervised learning, unsupervised learning, and reinforcement learning. In our case, we’re looking at supervised learning, where the inputs and outputs of all training data is known. We give it a historical date in Toronto (input) and we know what the weather was (output). Defining the problem and training model make it much easier to determine what method will be used to train the machine learning algorithm down the line.

So you’re going to forecast the weather (or solve a different machine learning problem) and you’ve determined it’s a real problem with the use of the handy checklist above. But where to start? A handy guide to the steps to solving a machine learning problem:

  1. Define Meaningful Data
  2. Define the Problem
  3. Determine the Method of Attack
  4. Generate Training and Test Data – rule of thumb: 70% train, 30% test
  5. Train and Test the Algorithm

Let’s go through the steps with our weather forecasting problem:

Step 1 is define meaningful data. What attributes matter and what is a “good” versus a “bad” datapoint? We can take a few attributes for the sake of our example, say temperature, rainfall, and wind speed, that together give us a pretty good idea of how the weather was on a given day. If we also had data on, for example, the median age of people in Toronto on a given date, we’d want to disclude that data as it is irrelevant to the problem and may cloud the results.

Steps 2 we already investigated by determining that the problem is a regression problem that uses supervised learning. Step 3 involves choosing an actual machine learning method which we won’t go into much detail here, so simplistically and for cohesiveness we will choose linear regression. Step 4 is actually getting the data (and leaving aside 30% of it to test the results!) and step 5 is the actual training/testing process.

As you may have noticed from these steps, actually training the algorithm is the last and least crucial step. The key to creating strong machine learning is to ensure that you have meaningful data, a well defined and specific problem, and solution all before ever even touching code.

That being said, even with a well defined solution that’s classified correctly, has meaningful data, proper test data, and inclusion of outliers in data trends, there is still plenty of room for mistakes. Most commonly, and often the fatal error behind many machine solutions, is under/overfitting. Underfitting, or high bias, means that the final function approximation is too simple and doesn’t represent the trend of the data well. Think about if we tried to draw a straight line through all the temperatures over a year in Toronto on a graph – you’d be lucky if it hit any data points at all! More common, and the more dangerous of the two, is overfitting or high variance. In this case, the approximated function is far to complex and doesn’t represent the data pattern. Overfitting often produces an even worse solution than underfitting and is an easy trap to fall into.

This has only been a basic introduction to machine learning, but resources for learning more are becoming rapidly more available and plenty of machine learning out-of-the-box algorithms and test datasets now exist to get started with and experiment in, in an assortment of languages and GUIs (some of the best resources ML today are available in Python), including Theano, Tensorflow, Weka or even available in Octave and MatLab.

Additional Resources:

 

Interested in solving cool problems and making amazing things? Come work with us! Check out all our open positions here.