So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.
This is the part that most people who want to do machine learning are excited about. I read Bishop’s and Murphy’s textbooks, watched Andrew Ng’s online course about ML and learned about different kinds of ML algorithms and I couldn’t wait to try them out and to see which machine learning model is the best for the data at hand.
You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.
And then you wake up from your nice dream.
In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. When running in production, if it’s not in one of the sklearn, Tensorflow or Pytorch libraries, it won’t fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.
For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.
And in practice, you run into many issues with the data. You’ll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You’ll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.
If you do get a model that is good enough, it’s time to deploy it, which comes with it’s own fun…