如何强制LineByLineTextDataset按单词而不是符号拆分文本语料库

2024-10-01 09:39:58 发布

您现在位置:Python中文网/ 问答频道 /正文

基于https://github.com/huggingface/tokenizers/issues/244问题,我试图完成我的请求,将WordLevel标记器与roberta transformers模型一起使用。我的词汇表包含数字作为字符串和特殊标记。我有一些问题,我可以定位哪里出了问题-但不知道如何解决它。情况如下:

tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max)
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./optiver.txt",
    block_size=128,
)

我看到,linebylineextdataset将数字拆分为不同的数字,这对我来说是错误的。我看到这是tokenizer.batch\u encode\u plus工作的结果。 我发现我需要添加的建议是在构造RobertaTokenizerFast时将拆分为\u words=True参数-但我没有成功。请解释我是如何用文字而不是符号来分割语料库的

下面是有关所用代码的更多详细信息:

from tokenizers.implementations import BaseTokenizer
class WordLevelBertTokenizer(BaseTokenizer):
    """ WordLevelBertTokenizer
    Represents a simple word level tokenization for BERT.
    """

    def __init__(
        self,
        vocab_file: str,
    ):
        tokenizer = Tokenizer(WordLevel.from_file(vocab_file))
        tokenizer.pre_tokenizer = pre_tokenizers.WhitespaceSplit()
        
        if vocab_file is not None:
            sep_token_id = tokenizer.token_to_id(str("</s>"))
            if sep_token_id is None:
                raise TypeError("sep_token not found in the vocabulary")
            cls_token_id = tokenizer.token_to_id(str("<s>"))
            if cls_token_id is None:
                raise TypeError("cls_token not found in the vocabulary")

            tokenizer.post_processor = BertProcessing(
                (str("</s>"), sep_token_id), (str("<s>"), cls_token_id)
            )

        parameters = {
            "model": "WordLevel",
            "sep_token": "</s>",
            "cls_token": "<s>",
            "pad_token": "<pad>",
            "mask_token": "<mask>",
        }

        super().__init__(tokenizer, parameters)


from transformers import RobertaConfig

tokenizer = WordLevelBertTokenizer("./wordlevel/vocab.json")

config = RobertaConfig(
    vocab_size=tokenizer.get_vocab_size(),
    max_position_embeddings=tokenizer.get_vocab_size(),
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("wordlevel", max_len=num_secs_max, add_prefix_space=True)

from transformers import RobertaForMaskedLM

model = RobertaForMaskedLM(config=config)

print(f'Num of model parameters = {model.num_parameters()}')

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="./optiver.txt",
    block_size=128,
)

也可以对batch_encode_plus进行简单测试 测试=['1234']] tokenizer.batch_encode_plus(t,is_split_into_words=True) 输出: {'input_id':[[1224,2,3,4,51225],'attention_mask':[[1,1,1,1,1]} 看起来这个输出不是我想要的,标记器将数字拆分为不同的数字。 这里是我的vocab文件的片段: {“0:1”,1:2,2:3,3:4,4:5,5:6,6:7,7:8,…1220:1221,1221:1222,“:1223,”“:1224,“:1225,”:1226}


Tags: fromtokenidsize数字nummaxsep