How to ML – Monitoring

As much as machine learning developers like to think that once they’ve got a good enough model and they deployed it, the job is done, it’s not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are? Maybe something is different in production then in all your test data. Maybe the data you collected for offline predictions includes pieces of data that are not available at inference time. For example, if trying to predict click through rates for items in a list and use that to rank the items, when building the training dataset it’s easy to include the rank of the item in the data. But the model won’t have that when making predictions, because it’s what you’re trying to infer. Surprise, the model will perform very poorly in production.

Or maybe simply A/B testing reveals that the fancy ML model doesn’t really perform better in production than the old rules written with lots of elbow grease by lots of developers and business analysts, using lots of domain knowledge and years of experience.

But even if the model does well at the beginning, will it continue to do so? Maybe there will be an external change in user behavior and they will start searching for other kinds of queries, which your model was not developed for. Or maybe your model will introduce a “positive” feedback loop: it suggests some items, users click on them, so those items get suggested more often, so more users click on them. This leads to a “rich get richer” kind of situation, but the algorithm is actually not making better and better suggestions.

Maybe you are on top of this and you keep retraining your model weekly to keep it in step with user behavior. But then you need to have a staggered release of the model, to make sure that the new one is really performing better across all relevant dimensions. Is inference speed still good enough? Are predictions relatively stable, meaning we don’t recommend only action movies one week and then only comedies next week? Are models even comparable from one week to another? Or is there a significant random component to them which makes it really hard to see how they improved? For example, how are the clusters from the user post data built up? K-means starts with random centroids and clusters from one run have only passing similarity to the ones from another run. How will you deal with that?

How to ML – Deploying

So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it’s time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it’s crucial to have the answer in tens of milliseconds? Then it’s time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won’t go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don’t even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. “Works on my machine” is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won’t let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you’ve killed all the dragons, now it’s time to keep watch… aka monitoring the models performance in production.

How to ML – Models

So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop’s and Murphy’s textbooks, watched Andrew Ng’s online course about ML and learned about different kinds of ML algorithms and I couldn’t wait to try them out and to see which machine learning model is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. When running in production, if it’s not in one of the sklearn, Tensorflow or Pytorch libraries, it won’t fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You’ll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You’ll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it’s time to deploy it, which comes with it’s own fun…

How to ML – Data

So we’ve decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we’ll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it’s kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don’t have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That’s the business model of many startups…

The worst case is that we don’t have data and we can’t find it out there. Maybe it’s because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can’t use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML…

How to ML – Metrics

We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we’re doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That’s a regression problem. Are we trying to predict who will win the election? That’s a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That’s a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it’s the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by “THE algorithm”.

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn’t decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn’t have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we’ll look at the next step: gathering the data.

What is ML? part 3

Yesterday we saw that machine learning is behind some successful products and it does have the potential to bring many more changes to our life.

So what is it?

Well, the textbook definition is that it’s the building of algorithms that can perform tasks they were not explicitly programmed to do. In practice, this means that we have algorithms that analyze large quantities of data to learn some patterns in the data, which can then be used to make predictions about new data points.

This is in contrast with the classical way of programming computers, where a programmer would use either their domain knowledge or they would analyze the data themselves and then write the program that has the correct output.

So one of the crucial distinctions is that in machine learning, the machine has to learn from the data. If a human being figures out the pattern and writes a regular expression to find addresses in text, that’s human learning, and we all go to school to do that.

Now does that mean that machine learning is a solution for everything? No. In some cases, it’s easier or cheaper to have a data analyst or a programmer find the pattern and code it up.

But there are plenty of cases where despite decades long efforts of big teams of researchers, humans haven’t been able to find an explicit pattern. The simplest example of this would be recognizing dogs in pictures. 99.99% of humans over the age of 5 have no problem recognizing a dog, whether a puppy, a golden retriever or a Saint Bernard, but they have zero insight into how they do it, what makes a bunch of pixels on the screen a dog and not a cat. And this is where machine learning shines: you give it a lot of photos (several thousands at least), pair each photo with a label of what it contains and the neural network will learn by itself what makes a dog a dog and not a cat.

Machine learning is just one tool that is available at our disposal, among many other tool. It’s a very powerful tool and it’s one that gets “sharpened” all the time, with lots of research being done all around the world to find better algorithms, to speed up their training and to make them more accurate.

Come back tomorrow to find out how the sausage is made, on a high level.

What is ML? part 2

Yesterday I wrote how AI made big promises in the past but it failed to deliver, but that now it’s different, because of machine learning.

What’s changed?

Well, now we have several products that work well with machine learning. My favorite example is Google Photos, Synology Moments and PhotoPrism. They are all photo management applications which use machine learning to automatically recognize all faces in pictures (easy, we had this for 15 years), recognize automatically which pictures are of the same person (hard, but doable by hand if you had too much time) and more than that, index photos by all kinds of objects that are found in them, so that you can search by what items appear in your photos (really hard, nobody had time to do that manually).

I have more than 10 years of photos uploaded to my Synology and one of my favorite party tricks when talking to someone is to whip out my phone and show them all the photos I have of them, since they were kids, or the last time that we met, or that funny thing that happened to them and I have photographic evidence of. Everyone is amazed by that (and some are horrified and deny that they looked like that when they were children). And there is not one, but at least three options to do this, one of which is open source, so that anyone can run in at home on their computer, for free, so there is demand for such a product.

Other successful examples are in the domain of recommender systems, YouTube being a good example. I have a love/hate relationship with it: on one hand, I wasted so many hours of my life to the recommendations it makes (which is proof of how good it is at making personalized suggestions), on the other hand, I found plenty of cool videos with it. This deep learning based recommender system is one of the factors behind the growth of the watch time on YouTube, which is basically the key metric behind revenue (more watch time, more ads).

These are just two examples that are available for everyone to use, and which serve as evidence that machine learning based AI now is not just hot air.

But I still haven’t answered the question what is ML… tomorrow, I promise.

What is ML?

Machine learning is everywhere these days. Mostly in newspapers, but it’s seeping into many real life, actual use cases. But what is it actually?

If you read only articles on TechCrunch, Forbes, Business Insider or even MIT Technology Review, you’d think it’s something that brings Model T800 to life soon, or that it will cure cancer and make radiologists useless, or that it will enable humans to upload their minds to the cloud and live forever, or that it will bring fully self driving cars by the end of the year (every year for the last 5 years).

Many companies want to get in on the ML bandwagon. It’s understandable: 1) that’s where the money is (some 10 billion dollars were invested in it in 2018) and 2) correctly done, applied to the right problems, ML can actually be really valuable, either by automating things that were previously done with manual labor or even by enabling things that were previously unfeasible.

But at the same time, a lot of ML projects make unrealistic promises, eat a lot of money and then deliver something that doesn’t work well enough to have a positive ROI. The ML engineers and researchers are happy, they got payed, analyzed the data and played around with building ML models, and maybe even published a paper or two. But the business is not happy, because they are not better off in any way.

This is not a new phenomenon. Artificial Intelligence, of which Machine Learning is a subdomain of, has been plagued by similar bubbles ever since it was founded. AI has gone through several AI winters already, in the 60s, 80s and late 90s. Big promises, few results.

To paraphrase Battlestar Galactica, “All this has happened before, all this will happen again but this time it’s different”. But why is it different? More about that tomorrow.