Transformers库中Pegasus模型单词/句子的最大输入长度

2024-06-26 02:16:18 发布

您现在位置:Python中文网/ 问答频道 /正文

在Transformers库中,Pegasus模型的单词和/或句子的最大输入长度是多少?我在Pegasus的研究论文中读到,最大值是512个标记,但这是多少个单词和/或句子?另外,您可以增加512个令牌的最大数量吗


Tags: 标记模型数量单词句子transformerspegasus
1条回答
网友
1楼 · 发布于 2024-06-26 02:16:18

In the Transformers library, what is the maximum input length of words and/or sentences of the Pegasus model? It actually depends on your pretraining. You can create a pegagsus model that supports a length of 100 tokens or 10000 tokens. For example the model google/pegasus-cnn_dailymail supports 1024 tokens, while google/pegasus-xsum supports 512:

from transformers import PegasusTokenizerFast

t = PegasusTokenizerFast.from_pretrained("google/pegasus-xsum")
t2 = PegasusTokenizerFast.from_pretrained("google/pegasus-cnn_dailymail")
print(t.max_len_single_sentence)
print(t2.max_len_single_sentence)

输出:

511
1023

由于添加到每个序列中的特殊标记,数字减少了一

I read in the Pegasus research paper that the max was 512 tokens, but how many words and/or sentences is that?

这取决于你的词汇量

from transformers import PegasusTokenizerFast
t = PegasusTokenizerFast.from_pretrained("google/pegasus-xsum")
print(t.tokenize('This is a test sentence'))
print("I know {} tokens".format(len(t)))

输出:

['▁This', '▁is', '▁a', '▁test', '▁sentence']
I know 96103 tokens

一个单词可以是一个标记,但也可以拆分为几个标记:

print(t.tokenize('neuropsychiatric conditions'))

输出:

['▁neuro', 'psych', 'i', 'atric', '▁conditions']

Also, can you increase the maximum number of 512 tokens?

是的,您可以为不同的输入长度训练具有pegasus体系结构的模型,但这是昂贵的

相关问题 更多 >