HuggingFace BERT 系列 - 1-MLM 实验检验 - NLP - bert | Minskiter's Blog = Blog of Minskiter = Share blogs

# 前言

在做 MLM 实验的时候，我经常对 BERT 中 AttentionMask 的值表示疑惑。在之前的实验操作中，如果一个 Token 被设置为 [MASK]，其 AttentionMask 就被设置为 0 以将其注意力忽略，但是细细思考又觉得这样的做法其实有问题 —— 如果注意力忽略，那么 [MASK] 和 [PAD] 又有什么区别？

# 区别测试

检测很简单，一般来说预训练模型都做过 NSP 和 MLM 任务，我们只需要构造一个通用的语句然后让其还原即可：

测试 1：

测试语句：

北京欢迎你！

MLM 任务：

北京[MASK][MASK]你

测试代码：

	# load model
	from transformers import BertLMHeadModel, BertTokenizer

	model_path = "hfl/chinese-bert-wwm-ext"

	model = BertLMHeadModel.from_pretrained(
	model_path
	)
	tokenizer = BertTokenizer.from_pretrained(model_path)

	# predict
	test_text = "北京[MASK][MASK]你!"

	# preprocess
	inputs = tokenizer(test_text, return_tensors="pt")
	# inputs["attention_mask"][0,3:5] = 0
	print(inputs)

	output = model(**inputs)

	output_ids = output.logits.argmax(dim=-1)
	print(output_ids.tolist())
	print(tokenizer.convert_ids_to_tokens(output_ids.tolist()[0]))

由于模型 chinese-bert-wwm-ext 使用了全词掩码的形式，则其预测是 欢迎 这个词的概率应该是更大的。以上代码为标准代码，执行之后得到的结果为:

{'input_ids': tensor([[ 101, 1266,  776,  103,  103,  872,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

[[4638, 1266, 776, 3614, 6816, 872, 106, 776]]

['的', '北', '京', '欢', '迎', '你', '!', '京']

对比代码则是取消 16 行的注释 inputs["attention_mask"][0,3:5] = 0 ，将欢迎这两个位置的 mask 设置为忽略。

得到的结果为：

{'input_ids': tensor([[ 101, 1266,  776,  103,  103,  872,  106,  102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 0, 0, 1, 1, 1]])}

[[4638, 1266, 776, 3300, 3300, 872, 106, 776]]

['的', '北', '京', '有', '有', '你', '!', '京']

可以看到这里并没有正确预测出 欢迎 这个词

# 结论

MLM 任务中（至少 chinese-wwm-ext 类的模型），其 [mask] 对应的 attentionmask 应该设置为 1 而不应该设置为 0（忽略注意力）。

# 勘误（2023-9-14 更新）

在实际上代码中发现在 huggingFace 里的 BertLMHeadModel 中 Label 是偏移一位的（即为 GPT 那样的预测下一个 Token 的方式）：

	lm_loss = None
	if labels is not None:
	# we are doing next-token prediction; shift prediction scores and input ids by one
	shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
	labels = labels[:, 1:].contiguous()
	loss_fct = CrossEntropyLoss()
	lm_loss = loss_fct(shifted_prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

而实际上做 MLM 的模型应该是 BertForMaskedLM ，其计算 loss 代码才是标准的交叉殇计算：

	masked_lm_loss = None
	if labels is not None:
	loss_fct = CrossEntropyLoss() # -100 index = padding token
	masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))

不过由于实验只是做了预测，并没有用到 loss，所以修正后并不影响实验结果。

# 前言

# 区别测试

# 结论

# 勘误（2023-9-14 更新）

CONTaiNER-基于对比学习的小样本命名实体识别

PCBERT 源码分析 - 1