在标记器模型名称列表中找不到“OSError:模型名称”。/XX“无法在转换器中加载自定义标记器”

2024-09-28 20:49:32 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用Sentencepiece使用我自己的数据集/词汇表创建我自己的标记器,然后将其与标记器转换器一起使用

我非常仔细地学习了关于如何通过拥抱面部从头开始训练模型的教程:https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=hO5M3vrAhcuj

    # import relevant libraries   
    from pathlib import Path
    from tokenizers import SentencePieceBPETokenizer
    from tokenizers.implementations import SentencePieceBPETokenizer
    from tokenizers.processors import BertProcessing
    from transformers import AlbertTokenizer
    

    paths = [str(x) for x in Path("./data").glob("**/*.txt")]
    

    # Initialize a tokenizer
    tokenizer = SentencePieceBPETokenizer(add_prefix_space=True)
    
    # Customize training
    tokenizer.train(files=paths, 
                    vocab_size=32000, 
                    min_frequency=2, 
                    show_progress=True,
                    special_tokens=['<unk>'],)

    # Saving model
    tokenizer.save_model("Sent-AlBERT")


    tokenizer = SentencePieceBPETokenizer(
        "./Sent-AlBERT/vocab.json",
        "./Sent-AlBERT/merges.txt",)

    tokenizer.enable_truncation(max_length=512)

在我尝试在transformers中重新创建标记器之前,一切都很好

    # Re-create our tokenizer in transformers
        tokenizer = AlbertTokenizer.from_pretrained("./Sent-AlBERT", do_lower_case=True)
  

这是我一直收到的错误消息:

OSError: Model name './Sent-AlBERT' was not found in tokenizers model name list (albert-base-v1, albert-large-v1, albert-xlarge-v1, albert-xxlarge-v1, albert-base-v2, albert-large-v2, albert-xlarge-v2, albert-xxlarge-v2). We assumed './Sent-AlBERT' was a path, a model identifier, or url to a directory containing vocabulary files named ['spiece.model'] but couldn't find such vocabulary files at this path or url.

出于某种原因,它可以与RobertaTokenizerFast一起使用,但不能与AlbertTokenzier一起使用

如果有人能给我一个建议或任何形式的指导,如何使用与阿尔伯托克尼泽句子,我将非常感谢


Tags: infrom标记importtruemodeltokenizerv2