Text classification is a very frequent use case for machine learning (ML) and natural language processing (NLP). It’s used for things like spam detection in emails, sentiment analysis for social media posts, or intent detection in chat bots.
In this series I am going to compare several libraries that can be used to train text classification models.
The fastText library
fastText is a tool from Facebook made specifically for efficient text classification. It’s written in C++ and optimized for multi-core training, so it’s very fast, being able to process hundreds of thousands of words per second per core. It’s very straightforward to use, either as a Python library or through a CLI tool.
Despite using an older machine learning model (a neural network architecture from 2016), fastText is still very competitive and provides an excellent baseline. If you also take into account resource usage, it will be all but impossible to improve on the fastText results, considering that the only models that perform better require powerful GPUs.
Getting started with text classification with fastText
fastText requires the training data for text classification to be in a special format: each document should be on a single line and the labels should be at the start of the line, with the prefix __label__
, like this:
Training data format
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe? __label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments __label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
If you use Doccano for annotating the text data, it has an option to export the data in fastText format. But even if you used another tool for annotation, it’s only a couple of lines of Python code to convert to the appropriate format. Let’s say we have our data in a JSONL
format, with each JSON object having a labels
key and a text
key. To convert to fastText format, we can use the following short snippet:
with open("fasttext.txt", "w") as output:
with open("dataset.jsonl", encoding="utf8") as f:
for l in f:
doc = json.loads(l)
labels = [x.replace(" ", "_") for x in doc['labels']]
labels = " ".join(f"__label__{x}" for x in labels)
txt = " ".join(l['text'].splitlines())
line = f"{labels} {txt}\n"
output.write(line)
Training text classification models with fastText
After you have the data in the right format, the simplest way to use fastText is through it’s CLI tool. After you installed it, you can train a model with the supervised
subcommand:
> ./fasttext supervised -input fasttext.txt -output model
Read 0M words
Number of words: 16568
Number of labels: 736
Progress: 100.0% words/sec/thread: 47065 lr: 0.000000 avg.loss: 10.027837 ETA: 0h 0m 0s
You can evaluate the model on a separate dataset with the test
subcommand and you will get the precision and recall for the first candidate label:
> ./fasttext test model.bin validation.txt
N 15404
P@1 0.162
R@1 0.0701
You can also get predictions for new documents:
> ./fasttext predict model.bin -
How to make lasagna?
__label__baking
Best way to chop meat
__label__food-safety
How to store steak
__label__food-safety
fastText comes with a builtin hyperparameter optimizer, to find the best model on a validation dataset, within the given time (5 minutes by default):
> ./fasttext supervised -input fasttext.txt -output model -autotune-validation validation.txt
If we reevaluate this model we’ll find it performs much better:
> ./fasttext test model.bin validation.txt
N 15404
P@1 0.727
R@1 0.315
A precision of 0.72, compared to 0.16 before. Not bad, for 10 minutes of our time, out of which 5 was waiting for the computer to find us a better model1Autotuning and performance evaluation should happen on separate datasets, to avoid overfitting, so real world performance is likely a bit worse than we got here.
Optimizing for different metrics
This library provides a couple of knobs you can use to try to obtain better models, from what kind of n-grams to use, how big the learning rate should be, what should be the loss function, but also what metric are you trying to optimize. Is precision or recall better aligned with your business KPIs? Is it more important to have the top result be a really good one or are you looking for several good results among in the top 5? Are you only interested in high confidence results? All this depends on the problem you are trying to solve and fastText provides ways to optimize for each of those.
Cons of fastText
Of course, fastText has some disadvantages:
- Not much flexibility – only one neural network architecture from 2016 implemented with very few parameters to tune
- No option to speed up using GPU
- Can be used only for text classification and word embeddings
- Doesn’t have too wide support in other tools (for deployments for example)
Conclusion
fastText is a great library to use when you want to start solving a text classification problem. In less than half an hour, you can get a good baseline going, which will tell you if this is a problem that is worth pursuing or not.
Hi, nitpick: looks like you’re autotuning on the validation set and asserting better performance on the validation set, probably best to get a third test set, perhaps even partition that up and show how it improves on different parts of unseen test data after hyperparameter tuning.
Can be tricky to not muddy the waters there, but for the best representation of accuracy/performance, best to always use a new unseen set of data for verifying performance.
Yes, using the same data for both autotuning and evaluating performance will lead to some overfitting, so the obtained performance numbers are not perfectly in line with what you will see in real world usage. I wrote it this way to keep the example short and because overfitting is another can of worms, worth a post by itself. But I added a footnote to clarify this.
I have to say, this sounds very fast.
I see it learns word vectors. In the paper I see the classification is using bag-of-ngram-vectors (and some hashing speedup tricks).
Do you know of a way to get sentence or document vectors? Would an RNN be useful?
RNNs are not the used anymore. Transformers are more popular and give good results. There is a model created for obtaining sentence vectors: https://www.sbert.net/