The Best Text Classification library for a Quick Baseline

Text classification is a very frequent use case for machine learning (ML) and natural language processing (NLP). It’s used for things like spam detection in emails, sentiment analysis for social media posts, or intent detection in chat bots.

In this series I am going to compare several libraries that can be used to train text classification models.

The fastText library

fastText is a tool from Facebook made specifically for efficient text classification. It’s written in C++ and optimized for multi-core training, so it’s very fast, being able to process hundreds of thousands of words per second per core. It’s very straightforward to use, either as a Python library or through a CLI tool.

Despite using an older machine learning model (a neural network architecture from 2016), fastText is still very competitive and provides an excellent baseline. If you also take into account resource usage, it will be all but impossible to improve on the fastText results, considering that the only models that perform better require powerful GPUs.

Getting started with text classification with fastText

fastText requires the training data for text classification to be in a special format: each document should be on a single line and the labels should be at the start of the line, with the prefix __label__, like this:

Training data format

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
 __label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
 __label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?

If you use Doccano for annotating the text data, it has an option to export the data in fastText format. But even if you used another tool for annotation, it’s only a couple of lines of Python code to convert to the appropriate format. Let’s say we have our data in a JSONL format, with each JSON object having a labels key and a text key. To convert to fastText format, we can use the following short snippet:

with open("fasttext.txt", "w") as output:
    with open("dataset.jsonl", encoding="utf8") as f:
        for l in f:
            doc = json.loads(l)
            labels = [x.replace(" ", "_") for x in doc['labels']]
            labels = " ".join(f"__label__{x}" for x in labels)
            txt = " ".join(l['text'].splitlines())
            line = f"{labels} {txt}\n"
            output.write(line)

Training text classification models with fastText

After you have the data in the right format, the simplest way to use fastText is through it’s CLI tool. After you installed it, you can train a model with the supervised subcommand:

> ./fasttext supervised -input fasttext.txt -output model
Read 0M words
Number of words:  16568
Number of labels: 736
Progress: 100.0% words/sec/thread:   47065 lr:  0.000000 avg.loss: 10.027837 ETA:   0h 0m 0s

You can evaluate the model on a separate dataset with the test subcommand and you will get the precision and recall for the first candidate label:

> ./fasttext test model.bin validation.txt
N       15404
P@1     0.162
R@1     0.0701

You can also get predictions for new documents:

> ./fasttext predict model.bin -
How to make lasagna?
__label__baking
Best way to chop meat
__label__food-safety
How to store steak
__label__food-safety

fastText comes with a builtin hyperparameter optimizer, to find the best model on a validation dataset, within the given time (5 minutes by default):

> ./fasttext supervised -input fasttext.txt -output model -autotune-validation validation.txt

If we reevaluate this model we’ll find it performs much better:

> ./fasttext test model.bin validation.txt
N       15404
P@1     0.727
R@1     0.315

A precision of 0.72, compared to 0.16 before. Not bad, for 10 minutes of our time, out of which 5 was waiting for the computer to find us a better model.

Optimizing for different metrics

This library provides a couple of knobs you can use to try to obtain better models, from what kind of n-grams to use, how big the learning rate should be, what should be the loss function, but also what metric are you trying to optimize. Is precision or recall better aligned with your business KPIs? Is it more important to have the top result be a really good one or are you looking for several good results among in the top 5? Are you only interested in high confidence results? All this depends on the problem you are trying to solve and fastText provides ways to optimize for each of those.

Cons of fastText

Of course, fastText has some disadvantages:

  • Not much flexibility – only one neural network architecture from 2016 implemented with very few parameters to tune
  • No option to speed up using GPU
  • Can be used only for text classification and word embeddings
  • Doesn’t have too wide support in other tools (for deployments for example)

Conclusion

fastText is a great library to use when you want to start solving a text classification problem. In less than half an hour, you can get a good baseline going, which will tell you if this is a problem that is worth pursuing or not.

Getting Started with Text Annotation

Data is crucial to any machine learning effort. And not just any data, but annotated data, so that the machine learning algorithms can learn what is the outcome it should predict. In some cases, we can get the data from some existing processes in the business, but more often than not, we need to set up a manual annotation process.

For annotating freeform text data (text generated by people) there is a great open source tool called Doccano. It is used to gather data for a wide range of common natural language processing (NLP) tasks, such as sentiment analysis, document classification, named entity recognition (NER), summarization, question answering, translation and others.

Text Annotation types

There are three kinds of data annotation types in Doccano.

Text classification

Document classification task in Doccano

This kind of project enables you to annotate labels that apply to the entire document. For example, in a sentiment analysis task, you could label a document as being positive or negative. In a document classification task you will annotate what’s the topic of the document. You can choose multiple labels for each document.

Sequence Labeling

Named Entity Recognition task in Doccano

This is generally used for NER tasks, where you select relevant fragments from the text. For example, where are persons or organizations mentioned in documents. There can be several fragments selected for each document.

Sequence to Sequence

Sequence to Sequence task in Doccano

The Seq2seq annotation is for tasks such as summarization, question answering or translations from one language to another. There is a text box where you can write the appropriate response. For summarization, this would be the summary of the document. For questions, you can write several answers.

Setting up Doccano

Doccano offers 1-click installs for AWS, Azure and Heroku, or you can run it locally using Docker.

After you have Doccano running, you must create a new project and import your documents. Doccano is quite flexible and you can import data in multiple formats, such as plain text, CSV, JSONL or even fastText format.

You can create multiple users who will work on annotation. They can review each others work or they can annotate independently each document. In this case, the annotations from different labelers can be compared. If there are big differences, maybe the task is not clear and better guidelines are needed – if humans can’t solve the problem, machine learning won’t be able to solve it either.

Doccano features

Doccano is trying to make the annotation workflow as efficient as possible by giving keyboard shortcuts for most actions.

It has a dashboard where you can see statistics about how many documents were annotated, what’s the frequency of labels and how many documents were processed by each labeler.

You can also speed up the process by using an existing machine learning model to bootstrap the annotations. Either when uploading the data you specify some existing labels or you can configure Docanno to make a call to another REST API and get annotations from there. Then the labelers only have to review the output of the algorithm, instead of annotating from scratch.

Text Annotation Alternatives

There are other annotation tools as well. One for example is Prodigy, from the makers of Spacy, one of the most popular NLP libraries. It has a tight integration with Spacy and it has support for active learning, but it’s a paid product, unlike Doccano.

Another option is Label Studio, which supports annotating images, audio and time series, not just text.

If you need help setting up a text annotation pipeline to make sure that you are gathering the right data for your problem, don’t hesitate to contact me.

The easiest way to get started with text classification

Machine learning (ML) has exploded in the last decade. Most companies try to apply ML in all kinds of areas, from image processing problems (such as recognizing defects in manufacturing), to forecasting, to trying to extract meaning from unstructured text, and many other problems. A quite common task is that of trying to classify documents into various classes. For example, you have many news articles and you want to group them by their topics, such as politics, entertainment, health, sports and travel. Another example would be a company that has many documents and wants to classify them by their type: invoices, resumes, various reports, and so on. 

One of the big challenges of machine learning is that it requires a lot of annotated data. It’s not enough to just get a lot of news articles, a human has to go and annotate at least several thousands of them with their topic and only then can you start applying ML algorithms to solve your problem. In general, the more annotated data points you have, the better accuracy you get. 

But getting the data is time-consuming and expensive. In some cases, you can crowdsource the data gathering, using a service such as Mechanical Turk, but in other cases, where more business domain knowledge is needed, the data annotation has to happen in house. If reading and classifying a document takes one minute, then annotating ten thousand documents will take 160 hours, so a month of full time work for someone. To ensure that your labels are accurate, because even human labelers make mistakes, the documents should be labeled by at least three humans. So the costs quickly go up.

SentenceBERT to the rescue

Recent developments in Natural Language Processing (NLP) research have led to the creation of neural networks that have a good understanding of language out of the box. One of them in particular can be used, with a clever reframing of the problem, to solve, or at least make it easier, our problem of text classification.

SentenceBERT is a followup to BERT, making it better by using siamese networks, and is used to generate sentence embeddings. None of this makes any sense? No problem, you don’t need to understand it to get started with it, but I’ll still try to explain the gist of it. 

(For some reason, many models in NLP are named after Sesame Street characters: ELMo, BERT, Rosita, ERNIE, Grover, KERMIT, Big BIRD 😄)

Figure 1: BERT model

The problem SentenceBERT is trained to solve is Natural Language Inference (NLI), which consists of having two sentences, a premise and a hypothesis, and the model has to say what’s the relationship between those two sentences. Does the premise entail the hypothesis, are they neutral (unrelated) or are they contradictory? For example “A soccer game with multiple males playing.” entails “Some men are playing a sport.”, but “A man inspects the uniform of a figure in some East Asian country.” contradicts “The man is sleeping.”. 

A side effect of trying to solve this problem is that SentenceBERT learns to “understand” sentences quite well. Understanding sentences is quite a philosophical debate, but what I mean by this is that it reduces a sentence (or even a paragraph) to a vector of numbers, such that sentences that are similar in meaning to each other have similar vectors assigned to them. These vectors are called embeddings and then can enable us to compare sentences. 

How does this help us? Remember, we wanted to do text classification of single documents, not to figure out the relationship between two documents. Well, some clever researchers from the University of Pennsylvania have found a clever way to reframe one problem into the other. 

Let’s say you want to classify news articles into topics such as politics, entertainment, health, sports, and travel. You take each topic and construct a sentence like “This text is about politics”. Now, this is a NLI problem: does the article entail our artificial sentence, which contains our topic?

It’s a very simple and incredible idea, but it turns out quite well in practice. 

Let’s put it into practice

We are going to use the Transformer library from an awesome company called HuggingFace 🤗. They provide a pipeline that does all this for us, so it’s quite simple to use in 6 lines:

from transformers import pipeline

classifier = pipeline("zero-shot-classification", device=0) 

sequence = "Who are you voting for in 2020?"

candidate_labels = ["politics", "public health", "economics"]

result = classifier(sequence, candidate_labels)
print(result)

And the output is: 

{'labels': ['politics', 'economics', 'public health'],

 'scores': [0.972518801689148, 0.014584126882255077, 0.012897057458758354],

 'sequence': 'Who are you voting for in 2020?'}

In this simple example, the question “Who are you voting for in 2020?” was classified as being about politics with 97% probability, economics 1.4% probability, and public health as 1.2% probability, so it got this example correctly. 

Running this requires a GPU and having all the libraries installed. It’s not hard to set up everything on your own computer, but it works better out of the box on Colab, a free environment Google provides for running Python notebooks in the cloud. You can even request to use a GPU in Colab. A more detailed notebook about this can be found here

If you want to try it out even simpler, without having to mess around with notebooks, HuggingFace offers a demo on their website, where you just paste in different texts and the list of labels and it classifies them for you.

Other languages 

All I presented above was for texts in English. But the same approach can work for other languages as well! There are pretrained models that are tuned for other languages, such as the xlm-roberta-large-xnli. This model supports 100 different languages, including Romanian. In general, results are best in the English language, because that’s where most of the data is (The XLM Roberta model was trained on 300 GB of English texts) and where most of the research has been focused, but even for Romanian language there is a dataset of 60Gb for training, so that should be enough getting things started. 

When to use this

As I mentioned before, this is best run on GPUs. You can run it on CPUs, but it will be much slower (10-20 times slower). The more labels you have, the slower it is. It’s a quite complicated model, so it takes a lot of resources. 

For text classification, there are many other models that are simpler, faster, and cheaper to run. But they have the disadvantage of requiring annotated data. If you have it, try to use those. 

But if for example you are prototyping an idea for a startup and you don’t have annotated data yet, this approach is very good to get you started. In the beginning, you will not have many documents to classify anyway, so the fact that it’s slower is not too problematic, and it will help you quickly validate your idea. If it works, you can then invest in gathering annotated data and then switch to a simpler model.

Another way this model can help is by bootstrapping the annotation process. You have a large set of documents without labels, you run this model over them to generate labels, which might have only 50% accuracy. Then the human labelers only have to verify the suggested labels, thus speeding up the annotation process. 

Conclusion

6 years ago, computer vision had it’s so-called “ImageNet” moment, when the challenge of labeling objects in images was “solved”. A new model was presented then which blew away all previous models. NLP is now getting closer to such a moment, with models such as SentenceBERT. In this article, I presented only how to use them for text classification, but they have many other use cases, such as finding similar articles, paraphrase mining, and so on. 

It’s an exciting time to be doing NLP!

Misunderstood ML suggestions

A couple of years ago I was working on a calendar application, on the machine learning team, to make it smarter. We had many great ideas, one of them being that once you indicated you wanted to meet with a group of people, the app would automatically suggest you a time slot for the meeting.

We worked on it for several months. Because we wanted to do things like learn every users working hours, which could vary based on many things, we couldn’t just use simple hand-coded rules. In the end, we implemented this feature using a combination of both hand coded rules (to avoid some bad edge cases) and machine learning. We did lots of testing, both automated and manually in our team.

Once the UI was ready, we did some user testing, where the new prototype was put in front of real users, unrelated to our team, who were recorded while they tried to use it and then were asked questions about the product. When the reports came in, the whole team banged their heads against the desk: most users thought we were suggesting times when the meeting couldn’t take place!

What happened? If you included either many people or even only one very busy person, there will be no empty slot which is good for everyone. So our algorithm would make three suggestions, saying that for each there would be a different person who might not be able to make the meeting.

In our own testing, it was obvious to us what was happening, so we didn’t consider it a big problem. But users who didn’t know the system, found it confusing and kept going to the classic grid to manually find a slot for the meeting.

Lesson: machine learning algorithms are never perfect and every project needs to be prepared to deal with mistakes.

How will your machine learning project handle failures? How will you explain to the end users the decisions the algorithm made? If you need help answering these questions, let’s talk.

GPT-3 and AGI

One of the most impressive/controversial papers from 2020 was GPT-3 from OpenAI. It’s nothing particularly new, it’s mostly a bigger version of GPT-2, which came out in 2019. GPT-3 is a much bigger version, being by far the largest machine learning model at the time it was released, with 175 billion parameters.

It’s a fairly simple algorithm: it’s learning to predict the next word in a text. It learns to do this by training on several hundred gigabytes of text gathered from the Internet. Then to use it, you give it a prompt (a starting sequence of words) and then it will start generating more words. Eventually it will decide to finish the text by emitting a stop token.

Using this seemingly stupid approach, GPT-3 is capable of generating a wide variety of interesting texts: it can write poems (not prize winning, but still), write news articles, imitate other well know authors, make jokes, argue for it’s self awareness, do basic math and, shockingly to programmers all over the world, who are now afraid the robots will take their jobs, it can code simple programs.

That’s amazing for such a simple approach. The internet was divided upon seeing these results. Some were welcoming our GPT-3 AI overlords, while others were skeptical, calling it just fancy parroting, without a real understanding of what it says.

I think both sides have a grain of truth. On one hand, it’s easy to find failure cases. It’s easy to make it say things like “a horse has five legs”, showing it doesn’t really know what a horse is. But are humans that different? Think of a small child who is being taught by his parents to say “Please” before his requests. I remember being amused by a small child saying “But I said please” when he was refused by his parents. The kid probably thought that “Please” is a magic word that can unlock anything. Well, not really, in real life we just use it because society likes polite people, but saying please when wishing for a unicorn won’t make it any more likely to happen.

And it’s not just little humans who do that. Sometimes even grownups parrot stuff without thinking about it, because that’s what they heard all their life and they never questioned it. It actually takes a lot of effort to think, to ensure consistency in your thoughts and to produce novel ideas. In this sense, expecting an artificial intelligence that is around human level might be a disappointment.

On the other hand, I believe there is a reason why this amazing result happened in the field of natural language processing and not say, computer vision. It has been long recognized that language is a powerful tool, there is even a saying about it: “The pen is mightier than the sword”. Human language is so powerful that we can encode everything that there is in this universe into it, and then some (think of all the sci-fi and fantasy books). More than that, we use language to get others to do our bidding, to motivate them, to cooperate with them and to change their inner state, making them happy or inciting them to anger.

While there is a common ground in the physical world, often times that is not very relevant to the point we are making: “A rose by any other name would smell as sweet”. Does it matter what a rose is when the rallying call is to get more roses? As long as the message gets across and is understood in the same way by all listeners, no, it doesn’t. Similarly, if GPTx can affect the desired change in it’s readers, it might be good enough, even if doesn’t have a mythical understanding of what those words mean.

How to ML – Monitoring

As much as machine learning developers like to think that once they’ve got a good enough model and they deployed it, the job is done, it’s not quite so.

The first couple of weeks after deployment are critical. Is the model really as good as offline tests said they are? Maybe something is different in production then in all your test data. Maybe the data you collected for offline predictions includes pieces of data that are not available at inference time. For example, if trying to predict click through rates for items in a list and use that to rank the items, when building the training dataset it’s easy to include the rank of the item in the data. But the model won’t have that when making predictions, because it’s what you’re trying to infer. Surprise, the model will perform very poorly in production.

Or maybe simply A/B testing reveals that the fancy ML model doesn’t really perform better in production than the old rules written with lots of elbow grease by lots of developers and business analysts, using lots of domain knowledge and years of experience.

But even if the model does well at the beginning, will it continue to do so? Maybe there will be an external change in user behavior and they will start searching for other kinds of queries, which your model was not developed for. Or maybe your model will introduce a “positive” feedback loop: it suggests some items, users click on them, so those items get suggested more often, so more users click on them. This leads to a “rich get richer” kind of situation, but the algorithm is actually not making better and better suggestions.

Maybe you are on top of this and you keep retraining your model weekly to keep it in step with user behavior. But then you need to have a staggered release of the model, to make sure that the new one is really performing better across all relevant dimensions. Is inference speed still good enough? Are predictions relatively stable, meaning we don’t recommend only action movies one week and then only comedies next week? Are models even comparable from one week to another? Or is there a significant random component to them which makes it really hard to see how they improved? For example, how are the clusters from the user post data built up? K-means starts with random centroids and clusters from one run have only passing similarity to the ones from another run. How will you deal with that?

How to ML – Deploying

So the ML engineer presented the model to the business stakeholders and they agreed that it performed well enough on the key metrics in testing that it’s time to deploy it to production.

So now we have to make sure the models run reliably in production. We have to answer some more questions, in order to make some trade offs.

How important is latency? Is the model making an inference in response to a user action, so it’s crucial to have the answer in tens of milliseconds? Then it’s time to optimize the model: quantize weights, distill knowledge to a smaller model, weight pruning and so on. Hopefully, your metrics won’t go down due to the optimization.

Can the results be precomputed? For example, if you want to make movie recommendations, maybe there can be a batch job that runs every night that does the inference for every user and stores them in a database. Then when the user makes a request, they are simply quickly loaded from the database. This is possible only if you have finite range of predictions to make.

Where are you running the model? On big beefy servers with a GPU? On mobile devices, which are much less powerful? Or on some edge devices that don’t even have an OS? Depending on the answer, you might have to convert the model to a different format or optimize it to be able to fit in memory.

Even in the easy case where you are running the model on servers and latency can be several seconds, you still have to do the whole dance of making it work there. “Works on my machine” is all to often a problem. Maybe production runs a different version of Linux, which has a different BLAS library and the security team won’t let you update things. Simple, just use Docker, right? Right, better hope you are good friends with the DevOps team to help you out with setting up the CI/CD pipelines.

But you’ve killed all the dragons, now it’s time to keep watch… aka monitoring the models performance in production.

How to ML – Models

So we finally got our data and we can get to machine learning. Without the data, there is no machine learning, there is at best human learning, where somebody tries to write an algorithm by hand to do the task at hand.

This is the part that most people who want to do machine learning are excited about. I read Bishop’s and Murphy’s textbooks, watched Andrew Ng’s online course about ML and learned about different kinds of ML algorithms and I couldn’t wait to try them out and to see which machine learning model is the best for the data at hand.

You start off with a simple one, a linear or logistic regression, to get a baseline. Maybe you even play around with the hyperparameters. Then you move on to a more complicated model, such as a random forest. You spend more time fiddling with it, getting 20% better results. Then you switch to the big guns, neural networks. You start with a simple one, with just 3 layers, and progressively end up with 100 ReLU and SIREN layers, dropout, batchnorm, ADAM, convolutions, attention mechanism and finally you get to 99% accuracy.

And then you wake up from your nice dream.

In practice, playing around with ML algorithms is just 10% of the job for an ML engineer. You do try out different algorithms, but you rarely write new ones from scratch. When running in production, if it’s not in one of the sklearn, Tensorflow or Pytorch libraries, it won’t fly. For proof of concept projects you might try to use the GitHub repo that accompanies a paper, but that path is full of pain, trying to find all the dependencies of undocumented code and to make it work.

For the hyperparameter tuning, there are libraries to help you with that, and anyway, the time it takes to finish the training runs is much larger than the time you spend coding it up, for any real life datasets.

And in practice, you run into many issues with the data. You’ll find that some of the columns in the data have lots of missing values. Or some of the datapoints that come from different sources have different meanings for the same columns. You’ll find conflicting or invalid labels. And that means going back to the data pipelines and fixing that bugs that occur there.

If you do get a model that is good enough, it’s time to deploy it, which comes with it’s own fun…

How to ML – Data

So we’ve decided what metrics we want to track for our machine learning project. Because ML needs data, we need to get it.

In some cases we get lucky and we already have it. Maybe we want to predict the failure of pieces of equipment in a factory. There are already lots of sensors measuring the performance of the equipment and there are service logs saying what was replaced for each equipment. In theory, all we need is a bit of a big data processing pipeline, say with Apache Spark, and we can get the data in the form of (input, output) pairs that can be fed into a machine learning classifiers that predicts if an equipment will fail based on the last 10 values measures from its sensors. In practice, we’ll find that sensors of the same time that come from different manufacturers have different ranges of possible values, so they will all have to be normalized. Or that the service logs are filled out differently by different people, so that will have to be standardized as well. Or worse, the sensor data is good, but it’s kept only for 1 month to save on storage costs so we have to fix that and wait a couple of months for more training data to accumulate.

The next best case is that we don’t have the data, but we can get it somehow. Maybe there are already datasets on the internet that we can download for free. This is the case for most face recognition applications: there are plenty of annotated face datasets out there, with various licenses. In some cases the dataset must be bought, for example, if we want to start a new ad network, there are plenty of datasets available online of personal data about everyone, which can be used then to predict the likelihood of clicking on an ad. That’s the business model of many startups…

The worst case is that we don’t have data and we can’t find it out there. Maybe it’s because we have a very specific niche, such as we want to find defects in the manufacturing process of our specific widgets, so we can’t use random images from the internet to learn this. Or maybe we want to do something that is really new (or very valuable), in which case we will have to gather the data ourselves. If we want to solve something in the physical world, that will mean installing sensors to gather data. After we get the raw data, such as images of our widgets coming of the production line, we will have to annotate those images. This means getting them in front of humans who know how to tell if a widget is good or defective. There needs to be a Q&A process in this, because even humans have an error rate, so each image will have to be labeled by at least three humans. We need several thousand samples, so this will take some time to set up, even if we can use crowdsourcing websites such as AWS Mechanical Turk to distribute the tasks to many workers across the world.

All this is done, we finally have data. Time to start doing the actual ML…

How to ML – Metrics

We saw that machine learning algorithms process large amounts of data to find patterns. But how exactly do they do that?

The first step in a machine learning project is establishing metrics. What exactly do we want to do and how do we know we’re doing it well?

Are we trying to predict a number? How much will Bitcoin cost next year? That’s a regression problem. Are we trying to predict who will win the election? That’s a binary classification problem (at least in the USA). Are we trying to recognize objects in an image? That’s a multi class classification problem.

Another question that has to be answered is what kind of mistakes are worse. Machine learning is not all knowing, so it will make mistakes, but there are trade-offs to be made. Maybe we are building a system to find tumors in X-rays: in that case it might be better that we call wolf too often and have false positives, rather than missing out on a tumor. Or maybe it’s the opposite: we are trying to implement a facial recognition system. If the system recognizes a burglar incorrectly, then the wrong person will get sent to jail, which is a very bad consequence for a mistake made by “THE algorithm”.

These are not just theoretical concerns, but they actually matter a lot in building machine learning systems. Because of this, many ML projects are human-in-the-loop, meaning the model doesn’t decide by itself what to do, it merely makes a suggestion which a human will then confirm. In many cases, that is valuable enough, because it makes the human much more efficient. For example, the security guard doesn’t have to look at 20 screens at once, but can only look at the footage that was flagged as anomalous.

Tomorrow we’ll look at the next step: gathering the data.