# Chapter 8. Advanced Sequence Modeling for Natural Language Processing

## Capturing More from a Sequence: Bidirectional Recurrent Models

The man who hunts ducks out on the weekends. 如果模型只从左到右观察，那么“duck”的表示将不同于从右到左观察单词的模型。人类一直在做这种回溯性的更新。

## Capturing More from a Sequence: Attention

“序列到序列模型，编码器 - 解码器模型和条件生成”中引入的S2S模型公式的一个问题是它将整个输入句子变成单个矢量（“编码”）φ并使用该编码生成输出，如图8-7所示。虽然这可能适用于非常短的句子，但对于长句，这样的模型无法捕获整个输入中的信息;例如，见Bengio等。 （1994）和Le和Zuidema（2016）。这是仅使用最终隐藏状态作为编码的限制。长输入的另一个问题是，当长时间输入反向传播时，梯度消失，使训练变得困难。 对于曾尝试翻译的双语/多语言读者来说，这种首先编码然后解码的过程可能会有点奇怪。作为人类，我们通常不会提炼句子的含义并从意义中产生翻译。对于图8-7中的示例，“pour” we know there will be a “for”; similarly “breakfast” is on our mind when we see “petit-déjeuner,” and so on. In other words, our mind focuses on the relevant parts of the input while producing output. This phenomenon is called attention. Attention has been widely studied in neuroscience and other allied fields, and it is what makes us quite successful despite having limited memories. Attention happens everywhere. In fact, it is happening right now to you, dear reader. Each. Word. You. Are. Reading. Now. Is. Being. Attended. To. Even if you have an exceptional memory, you’re probably not reading this entire book as a string. When you are reading a word, you are paying attention to the neighboring word, possibly the topic of the Section and Chapter, and so on.

Note

## 在构建新模型和尝试新体系结构时，您应该在建模选择和评估这些选择之间实现更快的迭代周期。

### A Vectorization Pipeline for NMT

class NMTVectorizer(object):
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
def __init__(self, source_vocab, target_vocab, max_source_length,
max_target_length):
"""
Args:
source_vocab (SequenceVocabulary): maps source words to integers
target_vocab (SequenceVocabulary): maps target words to integers
max_source_length (int): the longest sequence in the source dataset
max_target_length (int): the longest sequence in the target dataset
"""
self.source_vocab = source_vocab
self.target_vocab = target_vocab

self.max_source_length = max_source_length
self.max_target_length = max_target_length

@classmethod
def from_dataframe(cls, bitext_df):
"""Instantiate the vectorizer from the dataset dataframe

Args:
bitext_df (pandas.DataFrame): the parallel text dataset
Returns:
an instance of the NMTVectorizer
"""
source_vocab = SequenceVocabulary()
target_vocab = SequenceVocabulary()
max_source_length, max_target_length = 0, 0

for _, row in bitext_df.iterrows():
source_tokens = row["source_language"].split(" ")
if len(source_tokens) > max_source_length:
max_source_length = len(source_tokens)
for token in source_tokens:

target_tokens = row["target_language"].split(" ")
if len(target_tokens) > max_target_length:
max_target_length = len(target_tokens)
for token in target_tokens:

return cls(source_vocab, target_vocab, max_source_length,
max_target_length


class NMTVectorizer(object):
""" The Vectorizer which coordinates the Vocabularies and puts them to use"""
"""Vectorize the provided indices

Args:
indices (list): a list of integers that represent a sequence
vector_length (int): an argument for forcing the length of index vector
"""
if vector_length < 0:
vector_length = len(indices)
vector = np.zeros(vector_length, dtype=np.int64)
vector[:len(indices)] = indices
return vector

def _get_source_indices(self, text):
"""Return the vectorized source text

Args:
text (str): the source text; tokens should be separated by spaces
Returns:
indices (list): list of integers representing the text
"""
indices = [self.source_vocab.begin_seq_index]
indices.extend(self.source_vocab.lookup_token(token)
for token in text.split(" "))
indices.append(self.source_vocab.end_seq_index)
return indices

def _get_target_indices(self, text):
"""Return the vectorized source text

Args:
text (str): the source text; tokens should be separated by spaces
Returns:
a tuple: (x_indices, y_indices)
x_indices (list): list of ints; observations in target decoder
y_indices (list): list of ints; predictions in target decoder
"""
indices = [self.target_vocab.lookup_token(token)
for token in text.split(" ")]
x_indices = [self.target_vocab.begin_seq_index] + indices
y_indices = indices + [self.target_vocab.end_seq_index]
return x_indices, y_indices

def vectorize(self, source_text, target_text, use_dataset_max_lengths=True):
"""Return the vectorized source and target text

Args:
source_text (str): text from the source language
target_text (str): text from the target language
use_dataset_max_lengths (bool): whether to use the max vector lengths
Returns:
The vectorized data point as a dictionary with the keys:
source_vector, target_x_vector, target_y_vector, source_length
"""
source_vector_length = -1
target_vector_length = -1

if use_dataset_max_lengths:
source_vector_length = self.max_source_length + 2
target_vector_length = self.max_target_length + 1

source_indices = self._get_source_indices(source_text)
source_vector = self._vectorize(source_indices,
vector_length=source_vector_length,

target_x_indices, target_y_indices = self._get_target_indices(target_text)
target_x_vector = self._vectorize(target_x_indices,
vector_length=target_vector_length,
target_y_vector = self._vectorize(target_y_indices,
vector_length=target_vector_length,
return {"source_vector": source_vector,
"target_x_vector": target_x_vector,
"target_y_vector": target_y_vector,
"source_length": len(source_indices)}


Example 8-3. Generating minibatches for the NMT example

def generate_nmt_batches(dataset, batch_size, shuffle=True,
drop_last=True, device="cpu"):
"""A generator function which wraps the PyTorch DataLoader; NMT Version """
shuffle=shuffle, drop_last=drop_last)

lengths = data_dict['x_source_length'].numpy()
sorted_length_indices = lengths.argsort()[::-1].tolist()

out_data_dict = {}
for name, tensor in data_dict.items():
out_data_dict[name] = data_dict[name][sorted_length_indices].to(device)
yield out_data_dict


### Encoding and Decoding in the NMT Model

Example 8-4. The NMTModel encapsulates and coordinates the encoder and decoder in a single forward method.

class NMTModel(nn.Module):
""" A Neural Machine Translation Model """
def __init__(self, source_vocab_size, source_embedding_size,
target_vocab_size, target_embedding_size, encoding_size,
target_bos_index):
"""
Args:
source_vocab_size (int): number of unique words in source language
source_embedding_size (int): size of the source embedding vectors
target_vocab_size (int): number of unique words in target language
target_embedding_size (int): size of the target embedding vectors
encoding_size (int): the size of the encoder RNN.
target_bos_index (int): index for BEGIN-OF-SEQUENCE token
"""
super(NMTModel, self).__init__()
self.encoder = NMTEncoder(num_embeddings=source_vocab_size,
embedding_size=source_embedding_size,
rnn_hidden_size=encoding_size)
decoding_size = encoding_size * 2
self.decoder = NMTDecoder(num_embeddings=target_vocab_size,
embedding_size=target_embedding_size,
rnn_hidden_size=decoding_size,
bos_index=target_bos_index)

def forward(self, x_source, x_source_lengths, target_sequence):
"""The forward pass of the model

Args:
x_source (torch.Tensor): the source text data tensor.
x_source.shape should be (batch, vectorizer.max_source_length)
x_source_lengths torch.Tensor): the length of the sequences in x_source
target_sequence (torch.Tensor): the target text data tensor
Returns:
decoded_states (torch.Tensor): prediction vectors at each output step
"""
encoder_state, final_hidden_states = self.encoder(x_source,
x_source_lengths)
decoded_states = self.decoder(encoder_state=encoder_state,
initial_hidden_state=final_hidden_states,
target_sequence=target_sequence)
return decoded_states


THE ENCODER Example 8-5. The encoder embeds the source words and extracts features with a bi-GRU

class NMTEncoder(nn.Module):
def __init__(self, num_embeddings, embedding_size, rnn_hidden_size):
"""
Args:
num_embeddings (int): size of source vocabulary
embedding_size (int): size of the embedding vectors
rnn_hidden_size (int): size of the RNN hidden state vectors
"""
super(NMTEncoder, self).__init__()

self.source_embedding = nn.Embedding(num_embeddings, embedding_size,
self.birnn = nn.GRU(embedding_size, rnn_hidden_size, bidirectional=True,
batch_first=True)

def forward(self, x_source, x_lengths):
"""The forward pass of the model

Args:
x_source (torch.Tensor): the input data tensor.
x_source.shape is (batch, seq_size)
x_lengths (torch.Tensor): vector of lengths for each item in the batch
Returns:
a tuple: x_unpacked (torch.Tensor), x_birnn_h (torch.Tensor)
x_unpacked.shape = (batch, seq_size, rnn_hidden_size * 2)
x_birnn_h.shape = (batch, rnn_hidden_size * 2)
"""
x_embedded = self.source_embedding(x_source)
# create PackedSequence; x_packed.data.shape=(number_items, embedding_size)
x_lengths = x_lengths.detach().cpu().numpy()

# x_birnn_h.shape = (num_rnn, batch_size, feature_size)
x_birnn_out, x_birnn_h  = self.birnn(x_packed)
# permute to (batch_size, num_rnn, feature_size)
x_birnn_h = x_birnn_h.permute(1, 0, 2)

# flatten features; reshape to (batch_size, num_rnn * feature_size)
#  (recall: -1 takes the remaining positions,
#           flattening the two RNN hidden vectors into 1)
x_birnn_h = x_birnn_h.contiguous().view(x_birnn_h.size(0), -1)

return x_unpacked, x_birnn_h


Input[0]
abcd_padded = torch.tensor([1, 2, 3, 4], dtype=torch.float32)
efg_padded = torch.tensor([5, 6, 7, 0], dtype=torch.float32)
h_padded = torch.tensor([8, 0, 0, 0], dtype=torch.float32)

Output[0]
Type: torch.FloatTensor
Shape/size: torch.Size([3, 4])
Values:
tensor([[ 1.,  2.,  3.,  4.],
[ 5.,  6.,  7.,  0.],
[ 8.,  0.,  0.,  0.]])
Input[1]
lengths = [4, 3, 1]
batch_first=True)
packed_tensor
Output[1]
PackedSequence(data=tensor([ 1.,  5.,  8.,  2.,  6.,  3.,  7.,  4.]),
batch_sizes=tensor([ 3,  2,  2,  1]))
Input[2]
unpacked_tensor, unpacked_lengths = \

describe(unpacked_tensor)
describe(unpacked_lengths)
Output[2]
Type: torch.FloatTensor
Shape/size: torch.Size([3, 4])
Values:
tensor([[ 1.,  2.,  3.,  4.],
[ 5.,  6.,  7.,  0.],
[ 8.,  0.,  0.,  0.]])
Type: torch.LongTensor
Shape/size: torch.Size([3])
Values:
tensor([ 4,  3,  1])


Example 8-7. The NMT Decoder constructs a target sentence from the encoded source sentence

class NMTDecoder(nn.Module):
def __init__(self, num_embeddings, embedding_size, rnn_hidden_size, bos_index):
"""
Args:
num_embeddings (int): number of embeddings is also the number of
unique words in target vocabulary
embedding_size (int): the embedding vector size
rnn_hidden_size (int): size of the hidden rnn state
bos_index(int): begin-of-sequence index
"""
super(NMTDecoder, self).__init__()
self._rnn_hidden_size = rnn_hidden_size
self.target_embedding = nn.Embedding(num_embeddings=num_embeddings,
embedding_dim=embedding_size,
self.gru_cell = nn.GRUCell(embedding_size + rnn_hidden_size,
rnn_hidden_size)
self.hidden_map = nn.Linear(rnn_hidden_size, rnn_hidden_size)
self.classifier = nn.Linear(rnn_hidden_size * 2, num_embeddings)
self.bos_index = bos_index

def _init_indices(self, batch_size):
""" return the BEGIN-OF-SEQUENCE index vector """

def _init_context_vectors(self, batch_size):
""" return a zeros vector for initializing the context """

def forward(self, encoder_state, initial_hidden_state, target_sequence):
"""The forward pass of the model

Args:
encoder_state (torch.Tensor): the output of the NMTEncoder
initial_hidden_state (torch.Tensor): The last hidden state in the  NMTEncoder
target_sequence (torch.Tensor): the target text data tensor
sample_probability (float): the schedule sampling parameter
probability of using model's predictions at each decoder step
Returns:
output_vectors (torch.Tensor): prediction vectors at each output step
"""
# We are making an assumption there: The batch is on first
# The input is (Batch, Seq)
# We want to iterate over sequence so we permute it to (S, B)
target_sequence = target_sequence.permute(1, 0)

# use the provided encoder hidden state as the initial hidden state
h_t = self.hidden_map(initial_hidden_state)

batch_size = encoder_state.size(0)
# initialize context vectors to zeros
context_vectors = self._init_context_vectors(batch_size)
# initialize first y_t word as BOS
y_t_index = self._init_indices(batch_size)

h_t = h_t.to(encoder_state.device)
y_t_index = y_t_index.to(encoder_state.device)
context_vectors = context_vectors.to(encoder_state.device)

output_vectors = []
# All cached tensors are moved from the GPU and stored for analysis
self._cached_p_attn = []
self._cached_ht = []
self._cached_decoder_state = encoder_state.cpu().detach().numpy()

output_sequence_size = target_sequence.size(0)
for i in range(output_sequence_size):

# Step 1: Embed word and concat with previous context
y_input_vector = self.target_embedding(target_sequence[i])
rnn_input = torch.cat([y_input_vector, context_vectors], dim=1)

# Step 2: Make a GRU step, getting a new hidden vector
h_t = self.gru_cell(rnn_input, h_t)
self._cached_ht.append(h_t.cpu().data.numpy())

# Step 3: Use the current hidden to attend to the encoder state
context_vectors, p_attn, _ = \
verbose_attention(encoder_state_vectors=encoder_state,
query_vector=h_t)

# auxiliary: cache the attention probabilities for visualization
self._cached_p_attn.append(p_attn.cpu().detach().numpy())

# Step 4: Use the current hidden and context vectors
#         to make a prediction to the next word
prediction_vector = torch.cat((context_vectors, h_t), dim=1)
score_for_y_t_index = self.classifier(prediction_vector)

# auxiliary: collect the prediction scores
output_vectors.append(score_for_y_t_index)


A CLOSER LOOK AT ATTENTION 了解注意机制在此示例中的工作方式非常重要。回想一下前面的部分，可以使用查询，键和值来描述注意机制。分数函数将查询向量和关键向量作为输入，以计算在值向量中选择的一组权重。在这个例子中，我们使用点积评分函数，但它不是唯一的.在这个例子中，解码器的隐藏状态被用作查询向量，编码器状态向量集是关键和值向量。

Example 8-8. Attention that does element-wise multiplication and summing more explicitly

def verbose_attention(encoder_state_vectors, query_vector):
"""
encoder_state_vectors: 3dim tensor from bi-GRU in encoder
query_vector: hidden state in decoder GRU
"""
batch_size, num_vectors, vector_size = encoder_state_vectors.size()
vector_scores = \
torch.sum(encoder_state_vectors * query_vector.view(batch_size, 1,
vector_size),
dim=2)
vector_probabilities = F.softmax(vector_scores, dim=1)
weighted_vectors = \
encoder_state_vectors * vector_probabilities.view(batch_size,
num_vectors, 1)
context_vectors = torch.sum(weighted_vectors, dim=1)
return context_vectors, vector_probabilities

def terse_attention(encoder_state_vectors, query_vector):
"""
encoder_state_vectors: 3dim tensor from bi-GRU in encoder
query_vector: hidden state
"""
vector_scores = torch.matmul(encoder_state_vectors,
query_vector.unsqueeze(dim=2)).squeeze()
vector_probabilities = F.softmax(vector_scores, dim=-1)
context_vectors = torch.matmul(encoder_state_vectors.transpose(-2, -1),
vector_probabilities.unsqueeze(dim=2)).squeeze()
return context_vectors, vector_probabilities


LEARNING TO SEARCH AND SCHEDULED SAMPLING

Example 8-9. The decoder with a sampling procedures (in bold) built into the forward pass

class NMTDecoder(nn.Module):
def __init__(self, num_embeddings, embedding_size, rnn_size, bos_index):
super(NMTDecoder, self).__init__()
# ... other init code here ...

# arbitrarily set; any small constant will be fine
self._sampling_temperature = 3

def forward(self, encoder_state, initial_hidden_state, target_sequence,
sample_probability=0.0):
if target_sequence is None:
sample_probability = 1.0
else:
# We are making an assumption there: The batch is on first
# The input is (Batch, Seq)
# We want to iterate over sequence so we permute it to (S, B)
target_sequence = target_sequence.permute(1, 0)
output_sequence_size = target_sequence.size(0)

# ... nothing changes from the other implementation

output_sequence_size = target_sequence.size(0)
for i in range(output_sequence_size):
# new: a helper boolean and the teacher y_t_index
use_sample = np.random.random() < sample_probability
if not use_sample:
y_t_index = target_sequence[i]

# Step 1: Embed word and concat with previous context
# ... code omitted for space
# Step 2: Make a GRU step, getting a new hidden vector
# ... code omitted for space
# Step 3: Use the current hidden to attend to the encoder state
# ... code omitted for space
# Step 4: Use the current hidden and context vectors
#         to make a prediction to the next word
prediction_vector = torch.cat((context_vectors, h_t), dim=1)
score_for_y_t_index = self.classifier(prediction_vector)
# new: sampling if boolean is true.
if use_sample:
# sampling temperature forces a peakier distribution
p_y_t_index = F.softmax(score_for_y_t_index *
self._sampling_temperature, dim=1)
# method 1: choose most likely word
# _, y_t_index = torch.max(p_y_t_index, 1)
# method 2: sample from the distribution
y_t_index = torch.multinomial(p_y_t_index, 1).squeeze()

# auxiliary: collect the prediction scores
output_vectors.append(score_for_y_t_index)

output_vectors = torch.stack(output_vectors).permute(1, 0, 2)

return output_vectors


## References

1. Yoshua Bengio, Patrice Simard, and Paolo Frasconi. (1994). “Learning long-term dependencies with gradient descent is difficult.” IEEE transactions on neural networks.

2. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. (2002) “BLEU: a method for automatic evaluation of machine translation.” In Proceedings ACL.. Hal Daumé III, John Langford, Daniel Marcu. (2009). “Search-based Structured Prediction.” In Machine Learning Journal.

3. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer. “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks.” In Proceedings of NIPS 2015.

4. Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. (2015). “Effective Approaches to Attention-based Neural Machine Translation.” In Proceedings of EMNLP.

5. Phong Le and Willem Zuidema. (2016). “Quantifying the Vanishing Gradient and Long Distance Dependency Problem in Recursive Neural Networks and Recursive LSTMs.” Proceedings of the 1st Workshop on Representation Learning for NLP.

6. Philipp Koehn, Rebecca Knowles. (2017). “Six Challenges for Neural Machine Translation.” In Proceedings of the First Workshop on Neural Machine Translation.

7. Graham Neubig. (2017). “Neural Machine Translation and Sequence-to-Sequence Models: A Tutorial.” arXiv:1703.01619.

8. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. (2017). “Attention is all you need.” In Proceedings of NIPS.

In this chapter, we reserve the symbol ϕ for encodings. This is not possible for streaming applications, but a large number of practical applications of NLP happen in a batch (non-streaming) context anyways. Sentences like the one in this example are called “Garden Path Sentences.” Such sentences are more common than one would imagine, for e.g., newspaper headlines use such constructs regularly. See https://en.wikipedia.org/wiki/Garden_path_sentence. Consider the two meanings of “duck” (i) □ (noun, quack quack) and (ii) evade (verb) The terminology key, values, and query can be quite confusing for the beginner, but we introduce them here anyway because they have now become a standard. It is worth reading this section (8.3.1) a few times until these concepts become clear. The “Key, Value, Query” terminology comes in because the attention was initially thought of as a search task. For an extended review of these concepts and attention in general, visit https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html. So much that the original 2002 paper that proposed that BLEU received a “test of the time award” in 2018. For an example, see https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py. SacreBLEU is the standard when it comes to machine translation evaluation. The dataset was retrieved from http://www.manythings.org/anki/. We also include the cases in which these subject-verb pairs are contractions, such as “I’m”, “we’re”, and “he’s”. This simply means that the model will be able to see the entire dataset 10 times faster. It doesn’t exactly follow that the convergence will happen in one-tenth the time, because it could be that the model needs to see this dataset for a smaller number of epochs, or some other confounding factor. Sorting the sequences in order takes advantage of a low-level CUDA primitive for RNNs. You should try to convince yourself of this by either visualizing the computations or drawing them out. As a hint, consider the single recurrent step: the input and last hidden are weighted and added together with the bias. If the input is all 0’s, what effect does the bias have on the output? We utilize the describe function shown in section 1.4. Starting from left to right on the sequence dimension, any position past the known length of the sequence is assumed to be masked. The Vectorizer prepends the BEGIN-OF-SEQUENCE token to the sequence, so the first observation is always a special token indicating the boundary. See section 7.3 of Graham Neubig’s tutorial for a discussion on connecting encoders and decoders in neural machine translation. See (Neubig, 2017). We refer you to Luong, Pham, and Manning (2011), in which they outline three different scoring functions. Each batch item is a sequence and the probabilities for each sequence sum to 1. Broadcasting happens when a tensor has a dimension of size 1. Let this tensor be called Tensor A. When Tensor A is used in an element-wise mathematical operation (such as addition or subtraction) with another tensor called Tensor B, its shape (the number of elements on each dimension) should be identical except for the dimension with size 1. The operation of Tensor A on Tensor B is repeated for each position in Tensor B. If Tensor A has a shape (10, 1, 10) and Tensor B has a shape (10, 5, 10), A+B will repeat the addition of Tensor A for each of the five positions in Tensor B. We refer you to two papers on this topic: “Search-based Structured Prediction” by Daumé, Langford, Marcu, and “Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks” by Bengio, Vinyals, Jaitly, Shazeer (2015). If you’re familiar with Monte Carlo sampling for optimization techniques such as Markov Chain Monte Carlo, you will recognize this pattern. Primarily, this is because gradient descent and automatic differentiation is an elegant abstraction between model definitions and their optimization. https://github.com/joosthub/nlpbook/chapters/chapter_8/example_8_5 We omit a plot for the first model because it attended to only the final state in the encoder RNN. As noted by Koehn and Knowles (2017), the attention weights are endemic of many different situations. We suspect the attention weights in the first model did not need to rely on attention as much because the information it needed was already encoded in the states of the encoder GRU.