基于https://github.com/huggingface/tokenizers/issues/244问题,我试图完成我的请求,将WordLevel标记器与roberta transformers模型一起使用。我的词汇表包含数字作为字符串和特殊标记。我有一些问题,我可以定位哪里出了问题-但不知道如何解决它。情况如下:
tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max)
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./optiver.txt",
block_size=128,
)
我看到,linebylineextdataset将数字拆分为不同的数字,这对我来说是错误的。我看到这是tokenizer.batch\u encode\u plus工作的结果。 我发现我需要添加的建议是在构造RobertaTokenizerFast时将拆分为\u words=True参数-但我没有成功。请解释我是如何用文字而不是符号来分割语料库的
下面是有关所用代码的更多详细信息:
from tokenizers.implementations import BaseTokenizer
class WordLevelBertTokenizer(BaseTokenizer):
""" WordLevelBertTokenizer
Represents a simple word level tokenization for BERT.
"""
def __init__(
self,
vocab_file: str,
):
tokenizer = Tokenizer(WordLevel.from_file(vocab_file))
tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
if vocab_file is not None:
sep_token_id = tokenizer.token_to_id(str("</s>"))
if sep_token_id is None:
raise TypeError("sep_token not found in the vocabulary")
cls_token_id = tokenizer.token_to_id(str("<s>"))
if cls_token_id is None:
raise TypeError("cls_token not found in the vocabulary")
tokenizer.post_processor = BertProcessing(
(str("</s>"), sep_token_id), (str("<s>"), cls_token_id)
)
parameters = {
"model": "WordLevel",
"sep_token": "</s>",
"cls_token": "<s>",
"pad_token": "<pad>",
"mask_token": "<mask>",
}
super().__init__(tokenizer, parameters)
from transformers import RobertaConfig
tokenizer = WordLevelBertTokenizer("./wordlevel/vocab.json")
config = RobertaConfig(
vocab_size=tokenizer.get_vocab_size(),
max_position_embeddings=tokenizer.get_vocab_size(),
num_attention_heads=12,
num_hidden_layers=6,
type_vocab_size=1,
)
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max, add_prefix_space=True)
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config)
print(f'Num of model parameters = {model.num_parameters()}')
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path="./optiver.txt",
block_size=128,
)
也可以对batch_encode_plus进行简单测试
测试=['1234']]
tokenizer.batch_encode_plus(t,is_split_into_words=True)
输出:
{'input_id':[[1224,2,3,4,51225],'attention_mask':[[1,1,1,1,1]}
看起来这个输出不是我想要的,标记器将数字拆分为不同的数字。
这里是我的vocab文件的片段:
{“0:1”,1:2,2:3,3:4,4:5,5:6,6:7,7:8,…1220:1221,1221:1222,“:1223,”“:1224,“:1225,”:1226}
目前没有回答
相关问题 更多 >
编程相关推荐