Home How to train a customized Name Entities Recognition (NER) model based on spaCy pre-trained model
Post
Cancel

How to train a customized Name Entities Recognition (NER) model based on spaCy pre-trained model

How to train a customized Name Entities Recognition (NER) model based on spaCy pre-trained model

There are a bunch of online resources to teach you how to train your own NER model by spaCy, so I will attach some links for reference and skip this part.

spaCy: Training the named entity recognizer Convert SpaCy training data to Dataturks NER JSON output Custom Named Entity Recognition Using spaCy

Back to our main point of this article, and let’s check the following code. I’m trying to train a customized NER model and retain the other pipeline in spaCy, i.e. tagger and parser. You can also add your own pipeline into the model as well.

Write a function as below, and load the model if you’ve pointed it out. I used default pre-trained model en_core_web_sm here.

1
2
3
4
5
6
7
8
9
10
def train_spacy(self, training_data, testing_data, dropout, display_freq=1, 
                output_dir=None, new_model_name="en_model"):
    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    # create blank Language class
    
    if self.model:
        nlp = spacy.load(self.model)
    else:
        nlp = spacy.blank('en')

The following is just like the tutorial from spaCy official website, create the pipeline or grab it, then add labels from the training data you’ve annotated already.

1
2
3
4
5
6
7
8
9
10
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)
    else:
        ner = nlp.get_pipe('ner')

    # add labels
    for _, annotations in training_data:
        for ent in annotations.get('entities'):
            ner.add_label(ent[2])

And now, let’s training. The initial loss as 100000, and I set up the early stop to avoid overfitting.

1
2
3
4
5
6
7
8
    # get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
    with nlp.disable_pipes(*other_pipes):  # only train NER
        optimizer = nlp.begin_training()
        print(">>>>>>>>>>  Training the model  <<<<<<<<<<\n")

        losses_best = 100000
        early_stop = 0

Use stochastic gradient descent as optimizer with mini-batch configuration, get more insights here.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
        for itn in range(self.n_iter):
            print(f"Starting iteration {itn + 1}")
            random.shuffle(training_data)
            losses = {}
            batches = minibatch(training_data, size=compounding(4., 32., 1.001))
            for batch in batches:
                text, annotations = zip(*batch)
                nlp.update(
                    text,  # batch of texts
                    annotations,  # batch of annotations
                    drop=dropout,  # dropout - make it harder to memorise data
                    sgd=optimizer,  # callable to update weights
                    losses=losses)

            if itn % display_freq == 0:
                print(f"Iteration {itn + 1} Loss: {losses}")

            if losses["ner"] < losses_best:
                early_stop = 0
                losses_best = int(losses["ner"])
            else:
                early_stop += 1

            print(f"Training will stop if the value reached {self.not_improve}, "
                  f"and it's {early_stop} now.\n")

            if early_stop >= self.not_improve:
                break

        print(">>>>>>>>>>  Finished training  <<<<<<<<<<")

After training, dump the model as a binary data to your on-premises disk. Put the last 10 lines inside the with nlp.disable_pipes(*other_pipes): will save the only ner pipeline. Otherwise, the entire model will be saved, that’s exactly what I need.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
        if output_dir:
            path = output_dir + f"en_model_ner_{round(losses_best, 2)}"
        else:
            path = os.getcwd() + f"/lib/inactive_model/en_model_ner_{round(losses_best, 2)}"
            os.mkdir(path)
        if testing_data:
            self.validate_spacy(model=nlp, data=testing_data)

    with nlp.use_params(optimizer.averages):
        nlp.meta["name"] = new_model_name
        bytes_data = nlp.to_bytes()
        lang = nlp.meta["lang"]
        pipeline = nlp.meta["pipeline"]

    model_data = dict(bytes_data=bytes_data, lang=lang, pipeline=pipeline)

    with open(path + '/model.pkl', 'wb') as f:
        pickle.dump(model_data, f)

You can load the model by the following function to load your model by your specified path on your disk.

1
2
3
4
5
6
7
8
9
def load_model(model_path):
    with open(model_path + '/model.pkl', 'rb') as f:
        model = pickle.load(f)
    nlp = spacy.blank(model['lang'])
    for pipe_name in model['pipeline']:
        pipe = nlp.create_pipe(pipe_name)
        nlp.add_pipe(pipe)
    nlp.from_bytes(model['bytes_data'])
    return nlp
This post is licensed under CC BY 4.0 by the author.

Connect PostgreSQL in docker container with Azure Data Studio

Run your own Apache Spark jobs in AWS EMR and S3

Comments powered by Disqus.