我使用预先训练的伯特模型(福楼拜)生成特征,作为多个分类器的输入。在这个例子中,他们只展示了如何处理一个单句,但我得到了一个大约40000个句子的完整文件(数据帧)。因此,给它所有的模型消耗大量的内存,我正在寻找一种方法,将小批量的句子传递给模型,这样就不会造成系统崩溃或“内存不足”错误。我提出了一个解决方案,该解决方案在tpbe进程的时候应该通过2000行,然后最后,我将所有批连接到一个numpy数组中。但由于某些原因,我得到以下错误:
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
layer : <class 'numpy.ndarray'>
Traceback (most recent call last):
File "knn_case_1.py", line 83, in <module>
train_data_x_emb, mdl = get_flaubert_layer(train_data_x)
File "knn_case_1.py", line 45, in get_flaubert_layer
layer = flaubert(token_ids)
File "/ho/geta/kelod/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
result = self.forward(*input, **kwargs)
File "/ho/geta/kelod/anaconda3/lib/python3.7/site-packages/transformers/modeling_flaubert.py", line 176, in forward
assert lengths.max().item() <= slen
RuntimeError: invalid argument 1:cannot perform reduction function max on tensor with no elements because the operation does not have an identity
以下是脚本:用于可复制环境:导入transfromers ans scikit学习库
train_data = pd.read_csv('./corpus_train/corpus_ix_aug_FMC.csv', sep='\t')
# read the train and test dataset
#train_data = pd.read_csv('./corpus_train/corpus_or_et_aug_avec_all_FM')
train_data_x = train_data.verbatim
train_data_y = train_data.etiquette
def spliterate(buf, chunk):
for start in range(0, buf.size, chunk):
yield buf[start:start + chunk]
def get_flaubert_layer(texte):
modelname = "flaubert-base-cased"
flaubert, log = FlaubertModel.from_pretrained(modelname, output_loading_info=True)
flaubert_tokenizer = FlaubertTokenizer.from_pretrained(modelname, do_lowercase=False)
tokenized = texte.apply((lambda x: flaubert_tokenizer.encode(x, add_special_tokens=True, max_length=512)))
#tokenized = flaubert_tokenizer.encode(elt, add_special_tokens=True, max_length=512)
max_len = 0
for i in tokenized.values:
if len(i) > max_len:
max_len = len(i)
padded = np.array([i + [0] * (max_len - len(i)) for i in tokenized.values])
print(np.array(padded).shape)
print(padded)
attention_mask = np.where(padded != 0, 1, 0)
print(attention_mask)
print(attention_mask.shape)
last_layer_ = []
for tmp in spliterate(padded, 2000):
if len(tmp) != 0:
token_ids = torch.tensor(tmp)
with torch.no_grad():
layer = flaubert(token_ids)
layer = layer [0][:,0,:].numpy()
print("layer :", type(layer))
last_layer_.append(layer)
#last_layer += layer
else:
None
print("last layer_ :", type(last_layer_))
print(len(last_layer_))
last_layer_np = numpy.stack(last_layer_, axis=0)
print(len(last_layer_np))
return last_layer_np, modelname
train_data_x_emb, mdl = get_flaubert_layer(train_data_x)
填充变量如下所示:
[[ 1 1041 21565 ... 0 0 0]
[ 1 391 177 ... 0 0 0]
[ 1 150 14206 ... 0 0 0]
...
[ 1 150 5799 ... 0 0 0]
[ 1 59 48 ... 0 0 0]
[ 1 175 65 ... 0 0 0]]
目前没有回答
相关问题 更多 >
编程相关推荐