输入无效。应为字符串、字符串列表/元组或整数列表/元组。ValueError:输入无效

Traceback (most recent call last): File "training_cross_data_2.py", line 240, in <module> training_data(f, root, testdir, dict_unc) File "training_cross_data_2.py", line 107, in training_data Xtrain_emb, mdlname = get_flaubert_layer(data) File "training_cross_data_2.py", line 40, in get_flaubert_layer tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))) File "/home/getalp/kelodjoe/anaconda3/envs/env/lib/python3.6/site-packages/pandas/core/series.py", line 3848, in apply mapped = lib.map_infer(values, f, convert=convert_dtype) File "pandas/_libs/lib.pyx", line 2329, in pandas._libs.lib.map_infer File "training_cross_data_2.py", line 40, in <lambda> tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))) File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 907, in encode **kwargs, File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1021, in encode_plus first_ids = get_input_ids(text) File "/home/anaconda3/envs/env/lib/python3.6/site-packages/transformers/tokenization_utils.py", line 1003, in get_input_ids "Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers." ValueError: Input is not valid. Should be a string, a list/tuple of strings or a list/tuple of integers.

def get_flaubert_layer(texte): # teste is dataframe which I take from an excel file language_model_dir= os.path.expanduser(args.language_model_dir) lge_size = language_model_dir[16:-1] # modify when on jean zay 27:-1 print(lge_size) flaubert = FlaubertModel.from_pretrained(language_model_dir) flaubert_tokenizer = FlaubertTokenizer.from_pretrained(language_model_dir) tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True))) max_len = 0 for i in tokenized.values: if len(i) > max_len: max_len = len(i) padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values]) attention_mask = np.where(padded != 0, 1, 0)

1条回答

网友

1楼 · 发布于 2024-09-28 05:26:49

您可能需要更改此行：

tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512, truncation=True)))

到

tokenized = flaubert_tokenizer.encode(texte["verbatim"], 
    add_special_tokens=True, 
    max_length=512, 
    truncation=True)`

这有两个好处：

您不需要将pandas行传递给tokenize函数（我猜这就是导致错误的原因）
您不是每行调用一次encode函数。这可能会加速标记化

相关问题更多 >

编程相关推荐

热门问题

热门文章