blob: ab1db2f8a134c15fb61d3f87a837073178ab72b0 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
Machine Translation with Transformer
====================================
In this notebook, we will show how to train Transformer introduced in
[1] and evaluate the pretrained model using GluonNLP. The model is both
more accurate and lighter to train than previous seq2seq models. We will
together go through:
1) Use the state-of-the-art pretrained Transformer model: we will
evaluate the pretrained SOTA Transformer model and translate a few
sentences ourselves with the ``BeamSearchTranslator`` using the SOTA
model;
2) Train the Transformer yourself: including loading and processing
dataset, define the Transformer model, write train script and
evaluate the trained model. Note that in order to obtain the
state-of-the-art results on WMT 2014 English-German dataset, it will
take around 1 day to have the model. In order to let you run through
the Transformer quickly, we suggest you to start with the ``TOY``
dataset sampled from the WMT dataset (by default in this notebook).
Preparation
-----------
Load MXNet and GluonNLP
~~~~~~~~~~~~~~~~~~~~~~~
.. code:: python
import warnings
warnings.filterwarnings('ignore')
import random
import numpy as np
import mxnet as mx
from mxnet import gluon
import gluonnlp as nlp
Set Environment
~~~~~~~~~~~~~~~
.. code:: python
np.random.seed(100)
random.seed(100)
mx.random.seed(10000)
ctx = mx.gpu(0)
Use the SOTA Pretrained Transformer model
-----------------------------------------
In this subsection, we first load the SOTA Transformer model in GluonNLP
model zoo; and secondly we load the full WMT 2014 English-German test
dataset; and finally evaluate the model.
Get the SOTA Transformer
~~~~~~~~~~~~~~~~~~~~~~~~
Next, we load the pretrained SOTA Transformer using the model API in
GluonNLP. In this way, we can easily get access to the SOTA machine
translation model and use it in your own application.
.. code:: python
import nmt
wmt_model_name = 'transformer_en_de_512'
wmt_transformer_model, wmt_src_vocab, wmt_tgt_vocab = \
nmt.transformer.get_model(wmt_model_name,
dataset_name='WMT2014',
pretrained=True,
ctx=ctx)
print(wmt_src_vocab)
print(wmt_tgt_vocab)
The Transformer model architecture is shown as below:
.. raw:: html
<div style="width: 500px;">
|transformer|
.. raw:: html
</div>
.. code:: python
print(wmt_transformer_model)
Load and Preprocess WMT 2014 Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We then load the WMT 2014 English-German test dataset for evaluation
purpose.
The following shows how to process the dataset and cache the processed
dataset for the future use. The processing steps include:
-
1) clip the source and target sequences
-
2) split the string input to a list of tokens
-
3) map the string token into its index in the vocabulary
-
4) append EOS token to source sentence and add BOS and EOS tokens to
target sentence.
Let's first look at the WMT 2014 corpus.
.. code:: python
import hyperparameters as hparams
wmt_data_test = nlp.data.WMT2014BPE('newstest2014',
src_lang=hparams.src_lang,
tgt_lang=hparams.tgt_lang,
full=False)
print('Source language %s, Target language %s' % (hparams.src_lang, hparams.tgt_lang))
wmt_data_test[0]
.. code:: python
wmt_test_text = nlp.data.WMT2014('newstest2014',
src_lang=hparams.src_lang,
tgt_lang=hparams.tgt_lang,
full=False)
wmt_test_text[0]
We then generate the target gold translations.
.. code:: python
wmt_test_tgt_sentences = list(wmt_test_text.transform(lambda src, tgt: tgt))
wmt_test_tgt_sentences[0]
.. code:: python
import dataprocessor
print(dataprocessor.TrainValDataTransform.__doc__)
.. code:: python
wmt_transform_fn = dataprocessor.TrainValDataTransform(wmt_src_vocab, wmt_tgt_vocab, -1, -1)
wmt_dataset_processed = wmt_data_test.transform(wmt_transform_fn, lazy=False)
print(*wmt_dataset_processed[0], sep='\n')
Create Sampler and DataLoader for WMT 2014 Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code:: python
wmt_data_test_with_len = gluon.data.SimpleDataset([(ele[0], ele[1], len(
ele[0]), len(ele[1]), i) for i, ele in enumerate(wmt_dataset_processed)])
Now, we have obtained data\_train, data\_val, and data\_test. The next
step is to construct sampler and DataLoader. The first step is to
construct batchify function, which pads and stacks sequences to form
mini-batch.
.. code:: python
wmt_test_batchify_fn = nlp.data.batchify.Tuple(
nlp.data.batchify.Pad(),
nlp.data.batchify.Pad(),
nlp.data.batchify.Stack(dtype='float32'),
nlp.data.batchify.Stack(dtype='float32'),
nlp.data.batchify.Stack())
We can then construct bucketing samplers, which generate batches by
grouping sequences with similar lengths.
.. code:: python
wmt_bucket_scheme = nlp.data.ExpWidthBucket(bucket_len_step=1.2)
.. code:: python
wmt_test_batch_sampler = nlp.data.FixedBucketSampler(
lengths=wmt_dataset_processed.transform(lambda src, tgt: len(tgt)),
use_average_length=True,
bucket_scheme=wmt_bucket_scheme,
batch_size=256)
print(wmt_test_batch_sampler.stats())
Given the samplers, we can create DataLoader, which is iterable.
.. code:: python
wmt_test_data_loader = gluon.data.DataLoader(
wmt_data_test_with_len,
batch_sampler=wmt_test_batch_sampler,
batchify_fn=wmt_test_batchify_fn,
num_workers=8)
len(wmt_test_data_loader)
Evaluate Transformer
~~~~~~~~~~~~~~~~~~~~
Next, we generate the SOTA results on the WMT test dataset. As we can
see from the result, we are able to achieve the SOTA number 27.35 as the
BLEU score.
We first define the ``BeamSearchTranslator`` to generate the actual
translations.
.. code:: python
wmt_translator = nmt.translation.BeamSearchTranslator(
model=wmt_transformer_model,
beam_size=hparams.beam_size,
scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha, K=hparams.lp_k),
max_length=200)
Then we caculate the ``loss`` as well as the ``bleu`` score on the WMT
2014 English-German test dataset. Note that the following evalution
process will take ~13 mins to complete.
.. code:: python
import time
import utils
eval_start_time = time.time()
wmt_test_loss_function = nmt.loss.SoftmaxCEMaskedLoss()
wmt_test_loss_function.hybridize()
wmt_detokenizer = nlp.data.SacreMosesDetokenizer()
wmt_test_loss, wmt_test_translation_out = utils.evaluate(wmt_transformer_model,
wmt_test_data_loader,
wmt_test_loss_function,
wmt_translator,
wmt_tgt_vocab,
wmt_detokenizer,
ctx)
wmt_test_bleu_score, _, _, _, _ = nmt.bleu.compute_bleu([wmt_test_tgt_sentences],
wmt_test_translation_out,
tokenized=False,
tokenizer=hparams.bleu,
split_compound_word=False,
bpe=False)
print('WMT14 EN-DE SOTA model test loss: %.2f; test bleu score: %.2f; time cost %.2fs'
%(wmt_test_loss, wmt_test_bleu_score * 100, (time.time() - eval_start_time)))
.. code:: python
print('Sample translations:')
num_pairs = 3
for i in range(num_pairs):
print('EN:')
print(wmt_test_text[i][0])
print('DE-Candidate:')
print(wmt_test_translation_out[i])
print('DE-Reference:')
print(wmt_test_tgt_sentences[i])
print('========')
Translation Inference
~~~~~~~~~~~~~~~~~~~~~
We herein show the actual translation example (EN-DE) when given a
source language using the SOTA Transformer model.
.. code:: python
import utils
print('Translate the following English sentence into German:')
sample_src_seq = 'We love each other'
print('[\'' + sample_src_seq + '\']')
sample_tgt_seq = utils.translate(wmt_translator,
sample_src_seq,
wmt_src_vocab,
wmt_tgt_vocab,
wmt_detokenizer,
ctx)
print('The German translation is:')
print(sample_tgt_seq)
Train Your Own Transformer
--------------------------
In this subsection, we will go though the whole process about loading
translation dataset in a more unified way, and create data sampler and
loader, as well as define the Transformer model, finally writing
training script to train the model yourself.
Load and Preprocess TOY Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Note that we use demo mode (``TOY`` dataset) by default, since loading
the whole WMT 2014 English-German dataset ``WMT2014BPE`` for the later
training will be slow (~1 day). But if you really want to train to have
the SOTA result, please set ``demo = False``. In order to make the data
processing blocks execute in a more efficient way, we package them in
the ``load_translation_data`` (``transform`` etc.) function used as
below. The function also returns the gold target sentences as well as
the vocabularies.
.. code:: python
demo = True
if demo:
dataset = 'TOY'
else:
dataset = 'WMT2014BPE'
data_train, data_val, data_test, val_tgt_sentences, test_tgt_sentences, src_vocab, tgt_vocab = \
dataprocessor.load_translation_data(
dataset=dataset,
src_lang=hparams.src_lang,
tgt_lang=hparams.tgt_lang)
data_train_lengths = dataprocessor.get_data_lengths(data_train)
data_val_lengths = dataprocessor.get_data_lengths(data_val)
data_test_lengths = dataprocessor.get_data_lengths(data_test)
data_train = data_train.transform(lambda src, tgt: (src, tgt, len(src), len(tgt)), lazy=False)
data_val = gluon.data.SimpleDataset([(ele[0], ele[1], len(ele[0]), len(ele[1]), i)
for i, ele in enumerate(data_val)])
data_test = gluon.data.SimpleDataset([(ele[0], ele[1], len(ele[0]), len(ele[1]), i)
for i, ele in enumerate(data_test)])
Create Sampler and DataLoader for TOY Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Now, we have obtained ``data_train``, ``data_val``, and ``data_test``.
The next step is to construct sampler and DataLoader. The first step is
to construct batchify function, which pads and stacks sequences to form
mini-batch.
.. code:: python
train_batchify_fn = nlp.data.batchify.Tuple(
nlp.data.batchify.Pad(),
nlp.data.batchify.Pad(),
nlp.data.batchify.Stack(dtype='float32'),
nlp.data.batchify.Stack(dtype='float32'))
test_batchify_fn = nlp.data.batchify.Tuple(
nlp.data.batchify.Pad(),
nlp.data.batchify.Pad(),
nlp.data.batchify.Stack(dtype='float32'),
nlp.data.batchify.Stack(dtype='float32'),
nlp.data.batchify.Stack())
target_val_lengths = list(map(lambda x: x[-1], data_val_lengths))
target_test_lengths = list(map(lambda x: x[-1], data_test_lengths))
We can then construct bucketing samplers, which generate batches by
grouping sequences with similar lengths.
.. code:: python
bucket_scheme = nlp.data.ExpWidthBucket(bucket_len_step=1.2)
train_batch_sampler = nlp.data.FixedBucketSampler(lengths=data_train_lengths,
batch_size=hparams.batch_size,
num_buckets=hparams.num_buckets,
ratio=0.0,
shuffle=True,
use_average_length=True,
num_shards=1,
bucket_scheme=bucket_scheme)
print('Train Batch Sampler:')
print(train_batch_sampler.stats())
val_batch_sampler = nlp.data.FixedBucketSampler(lengths=target_val_lengths,
batch_size=hparams.test_batch_size,
num_buckets=hparams.num_buckets,
ratio=0.0,
shuffle=False,
use_average_length=True,
bucket_scheme=bucket_scheme)
print('Validation Batch Sampler:')
print(val_batch_sampler.stats())
test_batch_sampler = nlp.data.FixedBucketSampler(lengths=target_test_lengths,
batch_size=hparams.test_batch_size,
num_buckets=hparams.num_buckets,
ratio=0.0,
shuffle=False,
use_average_length=True,
bucket_scheme=bucket_scheme)
print('Test Batch Sampler:')
print(test_batch_sampler.stats())
Given the samplers, we can create DataLoader, which is iterable. Note
that the data loader of validation and test dataset share the same
batchifying function ``test_batchify_fn``.
.. code:: python
train_data_loader = nlp.data.ShardedDataLoader(data_train,
batch_sampler=train_batch_sampler,
batchify_fn=train_batchify_fn,
num_workers=8)
print('Length of train_data_loader: %d' % len(train_data_loader))
val_data_loader = gluon.data.DataLoader(data_val,
batch_sampler=val_batch_sampler,
batchify_fn=test_batchify_fn,
num_workers=8)
print('Length of val_data_loader: %d' % len(val_data_loader))
test_data_loader = gluon.data.DataLoader(data_test,
batch_sampler=test_batch_sampler,
batchify_fn=test_batchify_fn,
num_workers=8)
print('Length of test_data_loader: %d' % len(test_data_loader))
Define Transformer Model
~~~~~~~~~~~~~~~~~~~~~~~~
After obtaining DataLoader, we then start to define the Transformer. The
encoder and decoder of the Transformer can be easily obtained by calling
``get_transformer_encoder_decoder`` function. Then, we use the encoder
and decoder in ``NMTModel`` to construct the Transformer model.
``model.hybridize`` allows computation to be done using symbolic
backend. We also use ``label_smoothing``.
.. code:: python
encoder, decoder = nmt.transformer.get_transformer_encoder_decoder(units=hparams.num_units,
hidden_size=hparams.hidden_size,
dropout=hparams.dropout,
num_layers=hparams.num_layers,
num_heads=hparams.num_heads,
max_src_length=530,
max_tgt_length=549,
scaled=hparams.scaled)
model = nmt.translation.NMTModel(src_vocab=src_vocab, tgt_vocab=tgt_vocab, encoder=encoder, decoder=decoder,
share_embed=True, embed_size=hparams.num_units, tie_weights=True,
embed_initializer=None, prefix='transformer_')
model.initialize(init=mx.init.Xavier(magnitude=3.0), ctx=ctx)
model.hybridize()
print(model)
label_smoothing = nmt.loss.LabelSmoothing(epsilon=hparams.epsilon, units=len(tgt_vocab))
label_smoothing.hybridize()
loss_function = nmt.loss.SoftmaxCEMaskedLoss(sparse_label=False)
loss_function.hybridize()
test_loss_function = nmt.loss.SoftmaxCEMaskedLoss()
test_loss_function.hybridize()
detokenizer = nlp.data.SacreMosesDetokenizer()
Here, we build the translator using the beam search
.. code:: python
translator = nmt.translation.BeamSearchTranslator(model=model,
beam_size=hparams.beam_size,
scorer=nlp.model.BeamSearchScorer(alpha=hparams.lp_alpha,
K=hparams.lp_k),
max_length=200)
print('Use beam_size=%d, alpha=%.2f, K=%d' % (hparams.beam_size, hparams.lp_alpha, hparams.lp_k))
Training Loop
~~~~~~~~~~~~~
Before conducting training, we need to create trainer for updating the
parameter. In the following example, we create a trainer that uses ADAM
optimzier.
.. code:: python
trainer = gluon.Trainer(model.collect_params(), hparams.optimizer,
{'learning_rate': hparams.lr, 'beta2': 0.98, 'epsilon': 1e-9})
print('Use learning_rate=%.2f'
% (trainer.learning_rate))
We can then write the training loop. During the training, we perform the
evaluation on validation and testing dataset every epoch, and record the
parameters that give the hightest BLEU score on validation dataset.
Before performing forward and backward, we first use ``as_in_context``
function to copy the mini-batch to GPU. The statement
``with mx.autograd.record()`` will locate Gluon backend to compute the
gradients for the part inside the block. For ease of observing the
convergence of the update of the ``Loss`` in a quick fashion, we set the
``epochs = 3``. Notice that, in order to obtain the best BLEU score, we
will need more epochs and large warmup steps following the original
paper as you can find the SOTA results in the first subsection. Besides,
we use Averaging SGD [2] to update the parameters, since it is more
robust for the machine translation task.
.. code:: python
best_valid_loss = float('Inf')
step_num = 0
#We use warmup steps as introduced in [1].
warmup_steps = hparams.warmup_steps
grad_interval = hparams.num_accumulated
model.setattr('grad_req', 'add')
#We use Averaging SGD [2] to update the parameters.
average_start = (len(train_data_loader) // grad_interval) * \
(hparams.epochs - hparams.average_start)
average_param_dict = {k: mx.nd.array([0]) for k, v in
model.collect_params().items()}
update_average_param_dict = True
model.zero_grad()
for epoch_id in range(hparams.epochs):
utils.train_one_epoch(epoch_id, model, train_data_loader, trainer,
label_smoothing, loss_function, grad_interval,
average_param_dict, update_average_param_dict,
step_num, ctx)
mx.nd.waitall()
# We define evaluation function as follows. The `evaluate` function use beam search translator
# to generate outputs for the validation and testing datasets.
valid_loss, _ = utils.evaluate(model, val_data_loader,
test_loss_function, translator,
tgt_vocab, detokenizer, ctx)
print('Epoch %d, valid Loss=%.4f, valid ppl=%.4f'
% (epoch_id, valid_loss, np.exp(valid_loss)))
test_loss, _ = utils.evaluate(model, test_data_loader,
test_loss_function, translator,
tgt_vocab, detokenizer, ctx)
print('Epoch %d, test Loss=%.4f, test ppl=%.4f'
% (epoch_id, test_loss, np.exp(test_loss)))
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
model.save_parameters('{}.{}'.format(hparams.save_dir, 'valid_best.params'))
model.save_parameters('{}.epoch{:d}.params'.format(hparams.save_dir, epoch_id))
mx.nd.save('{}.{}'.format(hparams.save_dir, 'average.params'), average_param_dict)
if hparams.average_start > 0:
for k, v in model.collect_params().items():
v.set_data(average_param_dict[k])
else:
model.load_parameters('{}.{}'.format(hparams.save_dir, 'valid_best.params'), ctx)
valid_loss, _ = utils.evaluate(model, val_data_loader,
test_loss_function, translator,
tgt_vocab, detokenizer, ctx)
print('Best model valid Loss=%.4f, valid ppl=%.4f'
% (valid_loss, np.exp(valid_loss)))
test_loss, _ = utils.evaluate(model, test_data_loader,
test_loss_function, translator,
tgt_vocab, detokenizer, ctx)
print('Best model test Loss=%.4f, test ppl=%.4f'
% (test_loss, np.exp(test_loss)))
Conclusion
----------
- Showcase with Transformer, we are able to support the deep neural
networks for seq2seq task. We have already achieved SOTA results on
the WMT 2014 English-German task.
- Gluon NLP Toolkit provides high-level APIs that could drastically
simplify the development process of modeling for NLP tasks sharing
the encoder-decoder structure.
- Low-level APIs in NLP Toolkit enables easy customization.
Documentation can be found at https://gluon-nlp.mxnet.io/index.html
Code is here https://github.com/dmlc/gluon-nlp
References
----------
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in
Neural Information Processing Systems. 2017.
[2] Polyak, Boris T, and Anatoli B. Juditsky. "Acceleration of
stochastic approximation by averaging." SIAM Journal on Control and
Optimization. 1992.
.. |transformer| image:: transformer.png