Transfomer教程

Pipeline

Pipeline工作原理

full_nlp_pipeline

将文本预处理为模型可以理解的格式；
将预处理好的文本送入模型；
对模型的预测值进行后处理，输出人类可以理解的格式。

文本预处理

将输入切分为词语、子词或者符号（例如标点符号），统称为 tokens；
根据模型的词表将每个 token 映射到对应的 token 编号（就是一个数字）；
根据模型的需要，添加一些额外的输入。

1. 使用Tokenizer来将文本切成相应的token向量

from transformers import AutoTokenizercheckpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)raw_inputs = ["I've been waiting for a HuggingFace course my whole life.","I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

输出中包含两个列表，一个是记录token的索引，另一个是用来标注token是否被填充。

{'input_ids': tensor([[  101,  1045,  1005,  2310,  2042,  3403,  2005,  1037, 17662, 12172, 2607,  2026,  2878,  2166,  1012,   102],[  101,  1045,  5223,  2023,  2061,  2172,   999,   102,     0,     0,0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]])
}

模型预测

使用checkpoint来加载模型然后就行预测输出

from transformers import AutoModelcheckpoint = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModel.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)raw_inputs = ["I've been waiting for a HuggingFace course my whole life.","I hate this so much!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
outputs = model(**inputs)
print(outputs)

输出结果如下：包含last_hidden_state，hidden_states，attentions三个属性

last_hidden_state

表示 Transformer 模型的最后一层的输出（通常是 encoder 或 decoder 的最后一层）。

形状: (batch_size, sequence_length, hidden_size)

hidden_states

保存每一层隐藏状态的值，包括模型的所有中间层输出。

形状: tuple(torch.FloatTensor)，每个张量形状为 (batch_size, sequence_length, hidden_size)

如果模型有 L 层，那么 hidden_states 会包含 L + 1 个张量：

前 L 个对应每一层的输出。

最后一个是embedding layer的输出。

需要在调用模型时设置 output_hidden_states=True

attentions

保存每一层注意力权重（Attention weights）的值。

形状: tuple(torch.FloatTensor)，每个张量形状为 (batch_size, num_heads, sequence_length, sequence_length)

这些权重表示模型在处理当前序列时，不同位置之间的依赖关系。

需要在调用模型时设置 output_attentions=True

BaseModelOutput(last_hidden_state=tensor([[[-0.1798,  0.2333,  0.6321,  ..., -0.3017,  0.5008,  0.1481],[ 0.2758,  0.6497,  0.3200,  ..., -0.0760,  0.5136,  0.1329],[ 0.9046,  0.0985,  0.2950,  ...,  0.3352, -0.1407, -0.6464],...,[ 0.1466,  0.5661,  0.3235,  ..., -0.3376,  0.5100, -0.0561],[ 0.7500,  0.0487,  0.1738,  ...,  0.4684,  0.0030, -0.6084],[ 0.0519,  0.3729,  0.5223,  ...,  0.3584,  0.6500, -0.3883]],[[-0.2937,  0.7283, -0.1497,  ..., -0.1187, -1.0227, -0.0422],[-0.2206,  0.9384, -0.0951,  ..., -0.3643, -0.6605,  0.2407],[-0.1536,  0.8988, -0.0728,  ..., -0.2189, -0.8528,  0.0710],...,[-0.3017,  0.9002, -0.0200,  ..., -0.1082, -0.8412, -0.0861],[-0.3338,  0.9674, -0.0729,  ..., -0.1952, -0.8181, -0.0634],[-0.3454,  0.8824, -0.0426,  ..., -0.0993, -0.8329, -0.1065]]],grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

对于情感分析任务，则需要再last_hidden层后面再加几个线性层，获得输出logits

数据分析

由于输出的结果是未经处理的元数据，因此还需要对数值进行下游任务适配。对于文本分类任务，则需要再经过一个softmax层来输出结果

import torch
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

得到结果，然后对照config表中来对应属性得到结果。

tensor([[4.0195e-02, 9.5980e-01],[9.9946e-01, 5.4418e-04]], grad_fn=<SoftmaxBackward0>)

常用的Pipelines

feature-extraction （获得文本的向量化表示）

fill-mask （填充被遮盖的词、片段）

ner（命名实体识别）

question-answering （自动问答）

sentiment-analysis （情感分析）

summarization （自动摘要）

text-generation （文本生成）

translation （机器翻译）

zero-shot-classification （零训练样本分类）

使用教程

如果在pipeline中想要用自己微调后的模型则需要在pipeline中指定模型路径参数model=‘～/user/model/xxxxx’

文本分类/情感分类

from transformers import pipelineclassifier = pipeline("sentiment-analysis")#text-classification也会调用同一个模型
result = classifier("我喜欢看电影")
print(result)
results = classifier(["我喜欢看这个电影", "我好烦这个人啊"]
)
print(results)

输出：

[{'label': 'NEGATIVE', 'score': 0.9837913513183594}]
[{'label': 'NEGATIVE', 'score': 0.9648574590682983}, {'label': 'NEGATIVE', 'score': 0.9151368141174316}]

因为这个底层模型只支持English。

上面的代码等价于：

import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassificationtokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")inputs = tokenizer(["我喜欢看这个电影", "我好烦这个人啊"], return_tensors="pt")
with torch.no_grad():logits = model(**inputs).logitspredicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

零样本分类

from transformers import pipelineclassifier = pipeline("zero-shot-classification")
result = classifier(
"This is a course about the Transformers library",
candidate_labels=["education", "politics", "business"],
)
print(result)

等价于

from transformers import AutoModelForSequenceClassification, AutoTokenizer
nli_model = AutoModelForSequenceClassification.from_pretrained('facebook/bart-large-mnli')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')premise = "This is a course about the Transformers library"
labels=["education", "politics", "business"]
hypothesis = f'This example is {labels}.'# run through model pre-trained on MNLI
x = tokenizer.encode(premise, hypothesis, return_tensors='pt',truncation_strategy='only_first')
logits = nli_model(x.to(device))[0]entail_contradiction_logits = logits[:,[0,2]]
probs = entail_contradiction_logits.softmax(dim=1)
prob_label_is_true = probs[:,1]

文本生成

from transformers import pipelinegenerator = pipeline("text-generation")
results = generator("In this course, we will teach you how to")
print(results)
results = generator("In this course, we will teach you how to",num_return_sequences=2,max_length=50
) 
print(results)

遮盖词填充

预测被盖住的词

from transformers import pipelineunmasker = pipeline("fill-mask")
results = unmasker("This course will teach you all about <mask> models.", top_k=2)
print(results)

其中top k就是选取概率最大的n个作为输出。

[{'sequence': 'This course will teach you all about mathematical models.', 'score': 0.19619858264923096, 'token': 30412, 'token_str': ' mathematical'}, {'sequence': 'This course will teach you all about computational models.', 'score': 0.04052719101309776, 'token': 38163, 'token_str': ' computational'}]

命名实体抽取

NER任务就是将文本中的实体如姓名、地点等提取出来

from transformers import pipelinener = pipeline("ner", grouped_entities=True)
results = ner("My name is Sylvain and I work at Hugging Face in Brooklyn.")
print(results)

[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18}, {'entity_group': 'ORG', 'score': 0.97960186, 'word': 'Hugging Face', 'start': 33, 'end': 45}, {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

这里通过设置参数 grouped_entities=True，使得 pipeline 自动合并属于同一个实体的多个子词 (token)，例如这里将“Hugging”和“Face”合并为一个组织实体，实际上 Sylvain 也进行了子词合并，因为分词器会将 Sylvain 切分为 S、##yl 、##va 和 ##in 四个 token。

model

模型工作原理

预训练模型的本体只包含基础的 Transformer 模块，对于给定的输入，它会输出一些神经元的值，称为 hidden states 或者特征 (features)。对于 NLP 模型来说，可以理解为是文本的高维语义表示。这些 hidden states 通常会被输入到其他的模型部分（称为 head），以完成特定的任务，例如送入到分类头中完成文本分类任务。即预训练模型后+n个线性层/head层即可完成特殊任务。

其实前面我们举例的所有 pipelines 都具有类似的模型结构，只是模型的最后一部分会使用不同的 head 以完成对应的任务。

Transformers 库封装了很多不同的结构，常见的有：

*Model （返回 hidden states，适用于自定义任务，你可以基于这些隐藏状态设计自己的任务或任务头）
*ForCausalLM （用于条件语言模型，适用于生成任务，对话补全等任务中）
*ForMaskedLM （用于遮盖语言模型，适用于在输入中填充被掩盖（mask）的词）
*ForMultipleChoice （用于多选任务，输入包括问题和多个选项，模型预测哪个选项最合适）
*ForQuestionAnswering （用于自动问答任务，模型输入为问题和段落，输出段落中问题答案的起始和结束位置。）
*ForSequenceClassification （用于文本分类任务）
*ForTokenClassification （用于 token 分类任务，例如 NER）

Transformers 库中这些模型是对不同任务需求的预训练模型的封装，实际就是对head层进行的修改。

使用教程

1. 使用checkpoint加载模型类（只使用AutoModel即可）

如果要使用本地模型，就可以将名称换成本地路径即可。

2. 保存模型（）

from transformers import AutoModelmodel = AutoModel.from_pretrained("bert-base-cased")

AutoModel.from_pretrained() 会自动缓存下载的模型权重，默认保存到 ~/.cache/huggingface/transformers，我们也可以通过 HF_HOME 环境变量自定义缓存目录。

由于 checkpoint 名称加载方式需要连接网络，因此在大部分情况下我们都会采用本地路径的方式加载模型。

部分模型的 Hub 页面中会包含很多文件，我们通常只需要下载模型对应的 config.json 和 pytorch_model.bin，以及分词器对应的 tokenizer.json、tokenizer_config.json 和 vocab.txt。

from transformers import AutoModelmodel = AutoModel.from_pretrained("bert-base-cased")
model.save_pretrained("./models/bert-base-cased/")

这会在保存路径下创建两个文件：

config.json：模型配置文件，存储模型结构参数，例如 Transformer 层数、特征空间维度等；
pytorch_model.bin：又称为 state dictionary，存储模型的权重。

Tokenizer

工作原理

分词：使用分词器按某种策略将文本切分为 tokens；tokenizer.tokenize()
映射：将 tokens 转化为对应的 token IDs。convert_tokens_to_ids()

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")sequence = "Using a Transformer network is simple"
tokens = tokenizer.tokenize(sequence)ids = tokenizer.convert_tokens_to_ids(tokens)

可以使用encode（）函数将两个步骤合并，并且encode()函数会自动添加模型需要的特殊token如[CLS]：101和[SEP]:102（用于开始和结束）

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")sequence = "Using a Transformer network is simple"
sequence_ids = tokenizer.encode(sequence)print(sequence_ids)

最常用的还是直接使用Tokenizer（句子），因为它还会自动添加模型所需要的属性以及特殊字符，如attention_mask等

使用教程

还是使用AutoTokenizer.from_pretrained()函数

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
tokenizer.save_pretrained("./models/bert-base-cased/")

tokenizer需要3个文件

special_tokens_map.json：用于映射特殊字符

tokenizer_config.json：分词器配置文件，参数等

vocab.txt：用于对应tokens

快速分词器

使用rust编写，除了能进行编码和解码之外，还能够追踪原文到 token 之间的映射（AutoTokenizer默认使用）

可以使用is_fast来查看是否使用分词器类型。还可以使用tokens来查看分好的tokens

from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "Hello world!"
encoding = tokenizer(example)print(encoding.is_fast)
#Trueprint(encoding.tokens())
#['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in', 'Brooklyn', '.', '[SEP]']print(encoding.word_ids())
#[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

追踪映射

就是可以将索引和单词进行映射。假如对于专有名词，tokenizer可能会分成多个tokens，但是可以使用word_ids()函数将同一个词的tokens映射到它自身。

词语/token ⇒ 文本：通过 word_to_chars()、token_to_chars() 函数来实现，返回词语/token 在原文中的起始和结束偏移量。

例如，前面例子中索引为 5 的 token 是 ‘##yl’，它对应的词语索引为 3，因此我们可以方便的从从原文中抽取出对应的 token 片段和词语片段：

token_index = 5
print('the 5th token is:', encoding.tokens()[token_index])
start, end = encoding.token_to_chars(token_index)
print('corresponding text span is:', example[start:end])
word_index = encoding.word_ids()[token_index] # 3
start, end = encoding.word_to_chars(word_index)
print('corresponding word span is:', example[start:end])

the 5th token is: ##yl
corresponding text span is: yl
corresponding word span is: Sylvain

词语 ⇔ token：前面的例子中我们使用 word_ids() 获取了整个 token 序列对应的词语索引。实际上，词语和 token 之间可以直接通过索引直接映射，分别通过 token_to_word() 和 word_to_tokens() 来实现：

token_index = 5
print('the 5th token is:', encoding.tokens()[token_index])
corresp_word_index = encoding.token_to_word(token_index)
print('corresponding word index is:', corresp_word_index)
start, end = encoding.word_to_chars(corresp_word_index)
print('the word is:', example[start:end])
start, end = encoding.word_to_tokens(corresp_word_index)
print('corresponding tokens are:', encoding.tokens()[start:end])

the 5th token is: ##yl
corresponding word index is: 3
the word is: Sylvain
corresponding tokens are: ['S', '##yl', '##va', '##in']

文本 ⇒ 词语/token：通过 char_to_word() 和 char_to_token() 方法来实现：

chars = 'My name is Sylvain'
print('characters of "{}" ars: {}'.format(chars, list(chars)))
print('corresponding word index: ')
for i, c in enumerate(chars):print('"{}": {} '.format(c, encoding.char_to_word(i)), end="")
print('\ncorresponding token index: ')
for i, c in enumerate(chars):print('"{}": {} '.format(c, encoding.char_to_token(i)), end="")

characters of "My name is Sylvain" ars: ['M', 'y', ' ', 'n', 'a', 'm', 'e', ' ', 'i', 's', ' ', 'S', 'y', 'l', 'v', 'a', 'i', 'n']
corresponding word index: 
"M": 0 "y": 0 " ": None "n": 1 "a": 1 "m": 1 "e": 1 " ": None "i": 2 "s": 2 " ": None "S": 3 "y": 3 "l": 3 "v": 3 "a": 3 "i": 3 "n": 3 
corresponding token index: 
"M": 1 "y": 1 " ": None "n": 2 "a": 2 "m": 2 "e": 2 " ": None "i": 3 "s": 3 " ": None "S": 4 "y": 5 "l": 5 "v": 6 "a": 6 "i": 7 "n": 7

多文本输入

在正常情况下需要同时处理一个batch的数据，多个文本输入时会有一些问题。

padding与truncation

在多个文本中，因为batch的文本不等长，因此需使用padding将文本变得等长。（因为张量必须是矩型）

本来是不等长的，使用padding来变成矩形。

Padding 操作通过 tokenizer(padding)参数来控制：

padding="longest"：将序列填充到当前 batch 中最长序列的长度；
padding="max_length"：将所有序列填充到模型能够接受的最大长度，例如 BERT 模型就是 512。

模型的 padding token ID 可以通过其分词器的 pad_token_id 属性获得。

batched_ids = [[200, 200, 200],[200, 200]
]
padding_id = 100batched_ids = [[200, 200, 200],[200, 200, padding_id],
]

使用padding后会出现一个问题，就是在计算注意力的时候，padding token也会参与到编码然后计算注意力，会导致和之前的结果不一样。因此要使用Attenttion mask技术来解决

截断操作通过tokenizer(truncation) 参数来控制，如果 truncation=True，那么大于模型最大接受长度的序列都会被截断，例如对于 BERT 模型就会截断长度超过 512 的序列。此外，也可以通过 max_length 参数来控制截断长度：

Attention Mask

Attention Mask 是一个尺寸与 input IDs 完全相同，且仅由 0 和 1 组成的张量，0 表示对应位置的 token 是填充符，不参与计算。当然，一些特殊的模型结构也会借助 Attention Mask 来遮蔽掉指定的 tokens。

这样就不会出现不同的结果。

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassificationcheckpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)sequence1_ids = [[200, 200, 200]]
sequence2_ids = [[200, 200]]
batched_ids = [[200, 200, 200],[200, 200, tokenizer.pad_token_id],
]
batched_attention_masks = [[1, 1, 1],[1, 1, 0],
]print(model(torch.tensor(sequence1_ids)).logits)
print(model(torch.tensor(sequence2_ids)).logits)
outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(batched_attention_masks))
print(outputs.logits)

长文本解决

大模型解决长文本输入问题_大模型如何解决超长输入限制-CSDN博客

使用一个支持长文的 Transformer 模型，例如 Longformer 和 LED（最大长度 4096）；
设定最大长度 max_sequence_length 以截断输入序列：sequence = sequence[:max_sequence_length]。
将长文切片为短文本块 (chunk)，然后分别对每一个 chunk 编码。

在 padding=True, truncation=True 设置下，同一个 batch 中的序列都会 padding 到相同的长度，并且大于模型最大接受长度的序列会被自动截断。

from transformers import AutoTokenizer, AutoModelForSequenceClassificationcheckpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"
]tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt")
print(tokens)
output = model(**tokens)
print(output.logits)

抽取任务中的长文本处理

如果上下文非常长，与问题拼接后就会超过模型最大长度怎么处理？

可以选择将上下文进行chunk切分，只需要在截断文本的时候加入参数 return_overflowing_tokens=True,tokenizer就会自动将长上下文截断成好几个chunk来进行识别。但是加入截断位置不合适，也会对语意有所误解。因此还可以使用stride参数来控制chunk的重叠部分的长度。

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)for ids in inputs["input_ids"]:print(tokenizer.decode(ids))[CLS] This sentence is not [SEP]
[CLS] is not too long [SEP]
[CLS] too long but we [SEP]
[CLS] but we are going [SEP]
[CLS] are going to split [SEP]
[CLS] to split it anyway [SEP]
[CLS] it anyway. [SEP]

可以看到在 max_length=6, stride=2 设置下，切分出的文本块最多只能包含 6 个 token，并且文本块之间有 2 个 token 重叠。如果我们进一步打印编码结果就会发现，除了常规的 token ID 和注意力 Mask 以外，还有一个 overflow_to_sample_mapping 项，它负责记录每一个文本块对应原文中的句子索引，例如：

sentence = "This sentence is not too long but we are going to split it anyway."
inputs = tokenizer(sentence, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs.keys())
print(inputs["overflow_to_sample_mapping"])sentences = ["This sentence is not too long but we are going to split it anyway.","This sentence is shorter but will still get split.",
]
inputs = tokenizer(sentences, truncation=True, return_overflowing_tokens=True, max_length=6, stride=2
)
print(inputs["overflow_to_sample_mapping"])

dict_keys(['input_ids', 'attention_mask', 'overflow_to_sample_mapping'])
[0, 0, 0, 0, 0, 0, 0]
[0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1]

同样地，对于每一个 chunk，我们对 chunk 中所有可能的文本片段都计算其为答案的概率，再从中取出概率最大的文本片段，最后将 token 索引映射回原文本作为输出即可得出结果

自定义token

当输入中包含自定义的标记符或者自定义的token时，tokenizer可能不会识别出，因此需要使用新token来加入到模型词表中。

add_tokens() 添加普通 token：参数是新 token 列表，如果 token 不在词表中，就会被添加到词表的最后。

add_special_tokens() 添加特殊 token：参数是包含特殊 token 的字典，键值只能从 bos_token, eos_token, unk_token, sep_token, pad_token, cls_token, mask_token, additional_special_tokens 中选择。同样地，如果 token 不在词表中，就会被添加到词表的最后。添加后，还可以通过特殊属性来访问这些 token，例如 tokenizer.cls_token 就指向 cls token。

特殊 token 的标准化 (normalization) 与普通 token 有一些不同，比如不会被小写

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)model = AutoModel.from_pretrained(checkpoint) 
num_added_toks = tokenizer.add_tokens(["new_token1", "my_new-token2"])special_tokens_dict = {"cls_token": "[MY_CLS]"}num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)model.resize_token_embeddings(len(tokenizer))

向词表中添加新 token 后，必须重置模型 embedding 矩阵的大小，也就是向矩阵中添加新 token 对应的 embedding，这样模型才可以正常工作，将 token 映射到对应的 embedding

token embedding初始化

当添加token后需要给他一个初始值，当训练数据量大时，它可以获得正常的值，当训练量不够时，需要使用它的“同义词”的embedding来作为初始化的值。

token_id = tokenizer.convert_tokens_to_ids('entity')
token_embedding = model.embeddings.word_embeddings.weight[token_id]

微调模型

数据处理

数据处理的步骤分为3个

1. 创建dataset对象

2. 创建dataloader对象，对dataset对象进行迭代

3. 便利dataloader对象，将data与label加载到模型中进行训练。

Dataset负责整理数据，Dataloader负责在整理好的数据中按照一定的规则取出batch_size个数据来供网络训练使用

from torch.utils.data import Dataset,DataLoaderdataset = MyDataset() # 第一步：构建 Dataset 对象
dataloader = DataLoader(dataset) # 第二步：通过Dataloader来构建迭代对象num_epoches = 100
for epoch in range(num_epoches):for _,data in enumerate(dataloader):

数据集的流程如图所示

for i, data in enumerate(dataloader): 会调用dataloader 的 __iter__() 方法，产生了一个DataLoaderIter（迭代器），这里判断使用单线程还是多线程，调用 DataLoaderIter 的 __next__() 方法来得到 batch data 。在__next__() 方法中使用 __next_index()方法调用sampler（采样器）获得index索引，接着通过 Dataset_fetcher 的 fetch() 方法根据index（索引）调用dataset的 __getitem__() 方法，然后用 collate_fn 把它们打包成batch。当数据读完后， __next__() 抛出一个 StopIteration 异常，for循环结束，dataloader 失效。

Dataset

Dataset必须要将__len__(),__getitem__(self, index)两个函数重写。

class Mydata(Dataset):def __getitem__(self,index):return self.x_data[index],self.y_data[index]def __len__(self):return self.lengthmydata = Mydata(mydataset)   # 定义一个实例
first = mydata[0]       # 获取数据集中的第一组数据，会自动调用__getitem__
length = len(mydata)     # 获取数据集的数据量的方法，会自动调用__len__

Dataloader

dataloader的使用不需要重写，只需要实例化即可。

DataLoader 的任务就是按batch加载数据，并且将样本转换成模型可以接受的输入格式。对于 NLP 任务，这个环节就是将每个 batch 中的文本按照预训练模型的格式进行编码（包括 Padding、截断等操作）。


def my_collate_fn(batch_samples):batch_sentence_1, batch_sentence_2 = [], []batch_label = []for sample in batch_samples:batch_sentence_1.append(sample['sentence1'])batch_sentence_2.append(sample['sentence2'])batch_label.append(int(sample['label']))X = tokenizer(batch_sentence_1, batch_sentence_2, padding=True, truncation=True, return_tensors="pt")y = torch.tensor(batch_label)return X, ymy_dataload = Dataloader(mydata, batch_size=4, num_workers=4, pin_memory=True, drop_last=True, collate_fn=my_collate_fn)

num_workers (int, optional): 这个参数决定了有几个进程来处理
pin_memory (bool, optional)：如果设置为True，那么data loader将会在返回它们之前，将tensors拷贝到CUDA中的固定内存中.
drop_last (bool, optional): 如果设置为True：那么训练的时候最后的data如果不满足组成一个batch就被扔掉。如果为False（默认），那么会继续正常执行，只是最后的batch_size会小一点。
collate_fn (callable, optional): 将一个list的sample组成一个mini-batch的函数(将输出格式进行统一。就是假如序列长度不同，采用dataloader会报错，因为无法将变长样本变成一个张量。所以要使用自定义的collate_fn 来解决)

模型构建

在模型后加上下游任务的层（自己需要的层），可以继承所需的model，然后再加层。

或者继承nn.module类后使用automodel.from_pretrained(checkpoint)来加载模型。

这个例子中就是在bert模型后加上了一个线性层用于分类。

from torch import nn
from transformers import AutoConfig
from transformers import BertPreTrainedModel, BertModeldevice = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using {device} device')class BertForPairwiseCLS(BertPreTrainedModel):def __init__(self, config):super().__init__(config)self.bert = BertModel(config, add_pooling_layer=False)self.dropout = nn.Dropout(config.hidden_dropout_prob)self.classifier = nn.Linear(768, 2)self.post_init()def forward(self, x):bert_output = self.bert(**x)cls_vectors = bert_output.last_hidden_state[:, 0, :]cls_vectors = self.dropout(cls_vectors)logits = self.classifier(cls_vectors)return logitsconfig = AutoConfig.from_pretrained(checkpoint)
model = BertForPairwiseCLS.from_pretrained(checkpoint, config=config).to(device)

模型训练

和深度学习的训练过程一样就不细说了

def train_loop(dataloader, model, loss_fn, optimizer, lr_scheduler, epoch, total_loss):progress_bar = tqdm(range(len(dataloader)))progress_bar.set_description(f'loss: {0:>7f}')finish_step_num = (epoch-1)*len(dataloader)model.train()for step, (X, y) in enumerate(dataloader, start=1):X, y = X.to(device), y.to(device)pred = model(X)loss = loss_fn(pred, y)optimizer.zero_grad()loss.backward()optimizer.step()lr_scheduler.step()total_loss += loss.item()progress_bar.set_description(f'loss: {total_loss/(finish_step_num + step):>7f}')progress_bar.update(1)return total_lossdef test_loop(dataloader, model, mode='Test'):assert mode in ['Valid', 'Test']size = len(dataloader.dataset)correct = 0model.eval()with torch.no_grad():for X, y in dataloader:X, y = X.to(device), y.to(device)pred = model(X)correct += (pred.argmax(1) == y).type(torch.float).sum().item()correct /= sizeprint(f"{mode} Accuracy: {(100*correct):>0.1f}%\n")

main函数如下（保留最好的）

from transformers import AdamW, get_schedulerlearning_rate = 1e-5
epoch_num = 3loss_fn = nn.CrossEntropyLoss()
optimizer = AdamW(model.parameters(), lr=learning_rate)
lr_scheduler = get_scheduler("linear",optimizer=optimizer,num_warmup_steps=0,num_training_steps=epoch_num*len(train_dataloader),
)total_loss = 0.
for t in range(epoch_num):print(f"Epoch {t+1}/{epoch_num}\n-------------------------------")total_loss = train_loop(train_dataloader, model, loss_fn, optimizer, lr_scheduler, t+1, total_loss)test_loop(valid_dataloader, model, mode='Valid')if valid_acc > best_acc:best_acc = valid_accprint('saving new weights...\n')torch.save(model.state_dict(), f'epoch_{t+1}_valid_acc_{(100*valid_acc):0.1f}_model_weights.bin')
print("Done!")

项目结构

project/
│
├── data/ # 数据文件夹
│ ├── raw/ # 原始数据
│ ├── processed/ # 处理后的数据
│ └── datasets.py # 数据加载和预处理脚本
├── src/ # 源代码文件夹
│ ├── models/ # 模型定义相关代码
│ │ ├── __init__.py # 初始化文件
│ │ └── model_architecture.py # 模型架构定义
│ │
│ ├── training/ # 训练相关代码
│ │ ├── __init__.py # 初始化文件
│ │ ├── train.py # 训练逻辑
│ │ └── evaluation.py # 模型评估
│ │
│ ├── utils/ # 工具函数和通用代码
│ │ ├── __init__.py # 初始化文件
│ │ └── helpers.py # 通用工具函数
│ │
│ └── configs/ # 配置文件
│ └── config.yaml # 项目参数配置
│
├── experiments/ # 实验记录文件夹
│ ├── exp1/ # 每次实验的单独文件夹
│ │ ├── logs/ # 日志
│ │ ├── checkpoints/ # 模型检查点
│ │ └── results.json # 实验结果
│ └── exp2/ # 另一次实验
│
├── tests/ # 测试相关代码
│ ├── test_data_loading.py # 测试数据加载
│ ├── test_model_training.py # 测试模型训练
│ └── test_utils.py # 测试工具函数
│
├── requirements.txt # Python依赖包列表
├── README.md # 项目介绍和使用说明
├── .gitignore # Git忽略文件
├── run_model1.sh # 执行脚本
└── main.py # 主入口脚本

参考

Hello! · Transformers快速入门

Pipeline

Pipeline工作原理

文本预处理

模型预测

数据分析

常用的Pipelines

使用教程

文本分类/情感分类

零样本分类

文本生成

遮盖词填充

命名实体抽取

model

模型工作原理

使用教程

Tokenizer

工作原理

使用教程

快速分词器

追踪映射

多文本输入

padding与truncation

Attention Mask

长文本解决

抽取任务中的长文本处理

自定义token

token embedding初始化

微调模型

数据处理

Dataset

Dataloader

模型构建

模型训练

项目结构

参考

相关资讯

热文排行

最新新闻

推荐新闻

热搜词