How to train a Spacy model for multi label classification

Books in Space

Let’s take a look at how to do multi label text classification with Spacy. In multi label text classification each text document can have zero, one or more labels associated with it. This makes the problem more difficult than regular multi-class classification, both from a learning perspective, but also from an evaluation perspective. Spacy offers some tools to make that easy.


Spacy is a great general purpose NLP library, that can be used out of the box for things like part of speech tagging, named entity recognition, dependency parsing, morphological analysis and so on. Besides the built-in modules, it can also be used to train custom models, for example for text classification.

Spacy is quite powerful out of the box, but the documentation is often lacking and there are some gotchas that can prevent a model from training, so below I am writing a simple guide to train a simple multi label text classification model with this library.

Training data format

Spacy requires training data to be in its own binary data format, so the first step will be to transform our data into this format. I will be working with the lex_glue/ecthr_a dataset in this example.

First, we have to load the dataset.

from datasets import load_dataset  

dataset = load_dataset("lex_glue", 'ecthr_a')

Which will output the following:

    train: Dataset({
        features: ['text', 'labels'],
        num_rows: 9000
    test: Dataset({
        features: ['text', 'labels'],
        num_rows: 1000
    validation: Dataset({
        features: ['text', 'labels'],
        num_rows: 1000
{'text': ['11.  At the beginning of the events relevant to the application, K. had a daughter, P., and a son, M., born in 1986 and 1988 respectively. P.’s father is X and M.’s father is V. From March to May 1989 K. was voluntarily hospitalised for about three months, having been diagnosed as suffering from schizophrenia. From August to November 1989 and from December 1989 to March 1990, she was again hospitalised for periods of about three months on account of this illness. In 1991 she was hospitalised for less than a week, diagnosed as suffering from an atypical and undefinable psychosis. It appears that social welfare and health authorities have been in contact with the family since 1989.',
  '12.  The applicants initially cohabited from the summer of 1991 to July 1993. In 1991 both P. and M. were living with them. From 1991 to 1993 K. and X were involved in a custody and access dispute concerning P. In May 1992 a residence order was made transferring custody of P. to X.',
  '93.  J. and M.’s foster mother died in May 2001.'],
 'labels': [4]}

The dataset comes with a train, validation and test split. The documents themselves are split into multiple paragraphs and the labels are just integers, not the actual string descriptions of labels. The actual labels are:

labels = ["Article 2", "Article 3", "Article 5", "Article 6", "Article 8", "Article 9", "Article 10", "Article 11", "Article 14", "Article 1 of Protocol 1"]

To transform a single document into the DocBin format, we have to parse the combined paragraphs with Spacy and add all the labels to the document. The parsing we do here is not very important, so we can use the smallest English model from Spacy.

import spacy

nlp = spacy.load("en_core_web_sm")
d = dataset['train'][0]
text = "\n\n".join(d['text'])
doc = nlp(d)
for l in labels:
    if l in d['labels']:
        doc.cats[l] = 1
        doc.cats[l] = 0


Which will output:

11. At the beginning of the events relevant
{'Article 2': 0, 'Article 3': 0, 'Article 5': 0, 'Article 6': 0, 'Article 8': 0, 'Article 9': 0, 'Article 10': 0, 'Articl  
e 11': 0, 'Article 14': 0, 'Article 1 of Protocol 1': 0}

One gotcha that I ran into was that you have to specify all the labels for each document (unlike with Fasttext): the ones that are for this document with “probability” 1, and the ones that are not applied with “probability” 0. Spacy won’t give any errors (unlike scikit-learn) if you don’t do this, but the model will not train and you will always get an accuracy of 0.

The above snippet can be made more efficient by using the built-in pipeline from Spacy, which processes documents in batches, but we will have to go over the documents twice, once to build up the list of joined paragraphs (which Spacy can process) and once to add the labels.

from spacy.tokens import DocBin  
from tqdm import tqdm

for t, o in [(dataset['train'], "ecthr_train.spacy"), (dataset['test'], "ecthr_dev.spacy")]:  
    db = DocBin()  
    docs = []  
    cats = []  
    print("Extracting text and labels")  
    for d in tqdm(t):  
        cats.append([labels[idx] for idx in d['labels']])  
    print("Processing docs with spaCy")  
    docs = nlp.pipe(docs, disable=["ner", "parser"])  
    print("Adding docs to DocBin")  
    for doc, cat in tqdm(zip(docs, cats), total=len(cats)):  
        for l in labels:  
            if l in cat:  
                doc.cats[l] = 1  
                doc.cats[l] = 0  
    print(f"Writing to disk {o}")  

Generating the model config

Spacy has it’s own config system for training models. You can generate a config with the following command:

> spacy init config --pipeline textcat_multilabel  config_efficiency.cfg
Generated config template specific for your use case
- Language: en
- Pipeline: textcat_multilabel
- Optimize for: efficiency
- Hardware: CPU
- Transformer: None
✔ Auto-filled config with all values
✔ Saved config
You can now add your data and train your pipeline:
python -m spacy train config_effiency.cfg --paths.train ./train.spacy ./dev.spacy

By default, it uses a simple bag of words model, but you can set it to use a bigger convolutional model:

spacy init config --pipeline textcat_multilabel --optimize accuracy config.cfg

One thing that I usually change in the generated config is the logging system. I either enable the Weight and Biases configuration (which requires wandb to be installed in the virtual environment) or at least enable the progress bar:

@loggers = "spacy.ConsoleLogger.v1"  
progress_bar = true

You can modify any of the hyperparameters of the pipeline here, such as optimizer type or the ngram_size of the model, which is 1 by default (and I usually increase it to 2-3).

Another thing you can set here is how should Spacy determine the best model at the end of training. You can weight the different metrics: micro/macro recall/precision/f1 scores. By default it looks only at the F1 score. Setting this depends very much on what problem you are trying to solve and what is more important from a business perspective.

Training the model

Spacy makes this super simple:

> spacy train config_effiency.cfg --paths.train ./ecthr_train.spacy ./ecthr_dev.spacy -o ecthr_model
ℹ  Saving to output directory: ecthr_model                                                                                                                                              
ℹ Using CPU         

=========================== Initializing pipeline ===========================
[2022-09-15 11:24:46,510] [INFO] Set up nlp object from config
[2022-09-15 11:24:46,519] [INFO] Pipeline: ['textcat_multilabel']
[2022-09-15 11:24:46,528] [INFO] Created vocabulary
[2022-09-15 11:24:46,529] [INFO] Finished initializing nlp object
[2022-09-15 11:26:41,921] [INFO] Initialized pipeline components: ['textcat_multilabel']
✔ Initialized pipeline

============================= Training pipeline =============================
ℹ Pipeline: ['textcat_multilabel']
ℹ Initial learn rate: 0.001
---  ------  -------------  ----------  ------
  0       0           0.25       53.31    0.53
  0     200          18.69       54.04    0.54
  0     400          16.42       53.91    0.54
  0     600          15.38       53.36    0.53
  0     800          15.10       54.18    0.54
  0    1000          14.22       54.17    0.54
  0    1200          14.58       53.56    0.54
  0    1400          15.57       55.64    0.56
  0    1600          14.29       56.39    0.56
  0    1800          15.79       56.68    0.57
  0    2000          13.49       57.56    0.58
  0    2200          14.21       56.86    0.57
  0    2400          17.43       57.06    0.57
  0    2600          15.71       58.51    0.59
  0    2800          13.17       56.02    0.56
  0    3000          14.36       57.86    0.58
  0    3200          17.20       58.35    0.58
  0    3400          14.84       57.91    0.58
  0    3600          14.22       56.84    0.57
  0    3800          17.36       59.81    0.60
  0    4000          15.39       54.60    0.55
  0    4200          12.04       58.29    0.58
  0    4400          12.85       58.35    0.58
  0    4600          12.25       58.71    0.59
  0    4800          14.68       59.31    0.59
  0    5000          18.53       59.00    0.59
  0    5200          13.58       59.54    0.60
  0    5400          16.04       58.90    0.59
✔ Saved pipeline to output directory

And now we have two models in the ecthr_model folder: the last one and the one that scored best according to the metrics defined in the config file.

Using the trained model

To use the model, load it in your inference pipeline and use it like any other Spacy model. The only difference will be that the resulting Doc object will have the cats attribute filled with the predictions for your multilabel classification problem.

import spacy

nlp = spacy.load("ecthr_model/model-best")

d = nlp(text)
{'Article 2': 0.3531339466571808,  
'Article 3': 0.2542854845523834,  
'Article 5': 0.34043481945991516,  
'Article 6': 0.4782226085662842,  
'Article 8': 0.450054407119751,  
'Article 9': 0.45071953535079956,  
'Article 10': 0.3821248412132263,  
'Article 11': 0.5566793084144592,  
'Article 14': 0.47893860936164856,  
'Article 1 of Protocol 1': 0.3836081027984619}

The output is the probability for each class. The model was trained with a threshold of 0.5, so it would consider only “Article 11” to be applied to this document, but you can choose a different threshold if you want a different precision/recall balance.

Cons of Spacy

  • Training is slow. Even the efficient architecture, which uses an n-gram bag of words model (with a linear layer on top, I guess) trains in half an hour. In contrast, scikit-learn can train a logistic regression in minutes.
  • Documentation has gaps: you often have to dig into the source code of Spacy to know exactly what is going on. And searching the internet is not always helpful, because there are many outdated answers and tutorials, which were written for previous versions of Spacy and are no longer relevant.


Spacy is another library that can be used to start training text classification models. It’s particularly great if you are already using it for some of the other things it provides, because then you need fewer dependencies and that can simplify your model maintenance and deployment.

How to use patterns for multilabel text classification annotation in Prodigy

Photo by George Pagan III on Unsplash

Prodigy is a great tool for annotating the datasets needed to train machine learning models. It has built in support for many kinds of tasks, from text classification, to named entity recognition and even for image and audio annotation.

One of the cool things about Prodigy is that it integrates with Spacy (they are created by the same company), so you can use active learning (having a model suggest annotations and then being corrected by humans) or you can leverage Spacy patterns to automatically suggest annotations.

Prodigy has various recipes for these things, but it doesn’t come with a recipe to use only patterns for manual annotation for a multilabel text classification problem, only in combination with an active learning loop. The problem is that for multi-label annotation, Prodigy does binary annotation for each document, meaning the human annotator will be shown only one label at a time and they’ll have to decide if it’s relevant to the document or not. If you have many labels, it means each document might be shown as many times as there are labels.

I recently had to solve a problem where I knew that most of the documents would have a single label, but in a few cases there would be multiple labels. I also had some pretty good patterns to help bootstrap the process, so I wrote a custom recipe that used only patterns for a multilabel text classification problem.

Code for custom recipe

To do this, I combined some code from the recipes that are provided by Prodigy for text categorization. Let’s see how it work.

First, let’s define the CLI arguments in a file called We’ll need:

    "textcat.manual_patterns",  # Name of the recipe
    dataset=("Dataset to save annotations to", "positional", None, str),
    source=("File path with data to annotate", "positional", None, str),
    spacy_model=("Loadable spaCy pipeline or blank:lang (e.g. blank:en)", "positional", None, str),
    labels=("Comma-separated label(s) to annotate or text file with one label per line", "option", "l", get_labels),
    patterns=("Path to match patterns file", "option", "pt", str),

Then we need to define the function that loads the stream of data, runs the PhraseMatcher on it and returns the project config:

def manual(
    dataset: str,
    source: Union[str, Iterable[dict]],
    spacy_model: str,
    labels: Optional[List[str]] = None,
    patterns: Optional[str] = None,
    log("RECIPE: Starting recipe textcat.manual_patterns", locals())
    log(f"RECIPE: Annotating with {len(labels)} labels", labels)
    stream = get_stream(
        source, rehash=True, dedup=True, input_key="text"
    nlp = spacy.load(spacy_model)

    matcher = PatternMatcher(nlp, prior_correct=5.0, prior_incorrect=5.0,
        label_span=False, label_task=True, filter_labels=labels,
        combine_matches=True, task_hash_keys=("label",),
    matcher = matcher.from_disk(patterns)
    stream = add_suggestions(stream, matcher, labels)

    return {
        "view_id": "choice",
        "dataset": dataset,
        "stream": stream,
        "config": {
            "labels": labels,
            "choice_style": "multiple",
            "choice_auto_accept": False,
            "exclude_by": "task",
            "auto_count_stream": True,

The last bit is the function which takes the suggestions generated by the PhraseMatcher and adds them to be selected by default in the UI. In this way, the annotators can quickly accept them:

def add_suggestions(stream, matcher, labels):
    texts = (eg for score, eg in matcher(stream))
    options = [{"id": label, "text": label} for label in labels]

    for eg in texts:
        task = copy.deepcopy(eg)

        task["options"] = options
        if 'label' in task:
            task["accept"] = [task['label']]
            del task['label']
        yield task

Expected file formats

Now let’s run the recipe. Assuming we have an news_headlines.jsonl file in the following format:

{"text":"Pearl Automation, Founded by Apple Veterans, Shuts Down"}
{"text":"Silicon Valley Investors Flexed Their Muscles in Uber Fight"}
{"text":"Uber is a Creature of an Industry Struggling to Grow Up"}
{"text": "Brad Pitt is divorcing Angelina Jolie"}
{"text": "Physicists discover new exotic particle"}

And an pattern file patterns.jsonl:

{"pattern": "Uber", "label": "Technology"}
{"pattern": "Brad Pitt", "label": "Entertainment"}
{"pattern": "Angelina Jolie", "label": "Entertainment"}
{"pattern": "physicists", "label": "Science"}

Running the custom recipe

You can start Prodigy with the following command:

> python -m prodigy textcat.manual_patterns news_headlines news_headlines.jsonl  blank:en --label "Science,Technology,Entertainment,Politics" --patterns patterns.jsonl -F .\

Using 4 label(s): Science, Technology, Entertainment, Politics
Added dataset news_headlines to database SQLite.
D:\Work\staa\prodigy_models\ UserWarning: [W036] The component 'matcher' does not have any patterns defined.
  texts = (eg for score, eg in matcher(stream))

✨  Starting the web server at http://localhost:8080 ...
Open the app in your browser and start annotating!

And you should see the following in the browser:

The full code for the recipe can be found here.

The Best Text Classification library for a Quick Baseline

Text classification is a very frequent use case for machine learning (ML) and natural language processing (NLP). It’s used for things like spam detection in emails, sentiment analysis for social media posts, or intent detection in chat bots.

In this series I am going to compare several libraries that can be used to train text classification models.

The fastText library

fastText is a tool from Facebook made specifically for efficient text classification. It’s written in C++ and optimized for multi-core training, so it’s very fast, being able to process hundreds of thousands of words per second per core. It’s very straightforward to use, either as a Python library or through a CLI tool.

Despite using an older machine learning model (a neural network architecture from 2016), fastText is still very competitive and provides an excellent baseline. If you also take into account resource usage, it will be all but impossible to improve on the fastText results, considering that the only models that perform better require powerful GPUs.

Getting started with text classification with fastText

fastText requires the training data for text classification to be in a special format: each document should be on a single line and the labels should be at the start of the line, with the prefix __label__, like this:

Training data format

__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
 __label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
 __label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?

If you use Doccano for annotating the text data, it has an option to export the data in fastText format. But even if you used another tool for annotation, it’s only a couple of lines of Python code to convert to the appropriate format. Let’s say we have our data in a JSONL format, with each JSON object having a labels key and a text key. To convert to fastText format, we can use the following short snippet:

with open("fasttext.txt", "w") as output:
    with open("dataset.jsonl", encoding="utf8") as f:
        for l in f:
            doc = json.loads(l)
            labels = [x.replace(" ", "_") for x in doc['labels']]
            labels = " ".join(f"__label__{x}" for x in labels)
            txt = " ".join(l['text'].splitlines())
            line = f"{labels} {txt}\n"

Training text classification models with fastText

After you have the data in the right format, the simplest way to use fastText is through it’s CLI tool. After you installed it, you can train a model with the supervised subcommand:

> ./fasttext supervised -input fasttext.txt -output model
Read 0M words
Number of words:  16568
Number of labels: 736
Progress: 100.0% words/sec/thread:   47065 lr:  0.000000 avg.loss: 10.027837 ETA:   0h 0m 0s

You can evaluate the model on a separate dataset with the test subcommand and you will get the precision and recall for the first candidate label:

> ./fasttext test model.bin validation.txt
N       15404
P@1     0.162
R@1     0.0701

You can also get predictions for new documents:

> ./fasttext predict model.bin -
How to make lasagna?
Best way to chop meat
How to store steak

fastText comes with a builtin hyperparameter optimizer, to find the best model on a validation dataset, within the given time (5 minutes by default):

> ./fasttext supervised -input fasttext.txt -output model -autotune-validation validation.txt

If we reevaluate this model we’ll find it performs much better:

> ./fasttext test model.bin validation.txt
N       15404
P@1     0.727
R@1     0.315

A precision of 0.72, compared to 0.16 before. Not bad, for 10 minutes of our time, out of which 5 was waiting for the computer to find us a better model1Autotuning and performance evaluation should happen on separate datasets, to avoid overfitting, so real world performance is likely a bit worse than we got here.

Optimizing for different metrics

This library provides a couple of knobs you can use to try to obtain better models, from what kind of n-grams to use, how big the learning rate should be, what should be the loss function, but also what metric are you trying to optimize. Is precision or recall better aligned with your business KPIs? Is it more important to have the top result be a really good one or are you looking for several good results among in the top 5? Are you only interested in high confidence results? All this depends on the problem you are trying to solve and fastText provides ways to optimize for each of those.

Cons of fastText

Of course, fastText has some disadvantages:

  • Not much flexibility – only one neural network architecture from 2016 implemented with very few parameters to tune
  • No option to speed up using GPU
  • Can be used only for text classification and word embeddings
  • Doesn’t have too wide support in other tools (for deployments for example)


fastText is a great library to use when you want to start solving a text classification problem. In less than half an hour, you can get a good baseline going, which will tell you if this is a problem that is worth pursuing or not.