回答此问题可获得 20 贡献值,回答如果被采纳可获得 50 分。
<p>我目前正在写一篇关于nlp的研讨会论文,源代码函数文档摘要。因此,我创建了自己的数据集,大约有64000个样本(37453是训练数据集的大小),我想对BART模型进行微调。为此,我使用基于huggingface包的simpletransformers包。我的数据集是一个数据帧。
我的数据集的一个示例:</p>
<p><a href="https://i.stack.imgur.com/AsNCG.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/AsNCG.png" alt="enter image description here"/></a></p>
<p>我的代码:</p>
<pre><code>train_df = pd.read_csv(train_path, index_col=0)
train_df.rename(columns={'text':'input_text', 'summary':'target_text'}, inplace=True)
# Logging
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
# Hyperparameters
model_args = Seq2SeqArgs()
model_args.num_train_epochs = 10
# bart-base = 32, bart-large-cnn = 16
model_args.train_batch_size = 16
# model_args.no_save = True
# model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True
model_args.overwrite_output_dir = True
model_args.save_model_every_epoch = False
model_args.save_eval_checkpoints = False
model_args.save_optimizer_and_scheduler = False
model_args.save_steps = -1
best_model_dir = 'drive/MyDrive/outputs/bart-large-cnn/best_model/'
model_args.best_model_dir = best_model_dir
# Initialize model
model = Seq2SeqModel(
encoder_decoder_type="bart",
encoder_decoder_name="facebook/bart-base",
args=model_args,
use_cuda=True,
)
# Train the model
model.train_model(
train_df,
# eval_data=eval_df,
# matches=count_matches,
)
</code></pre>
<p>到目前为止一切都很好,但我在开始训练时出现了这个错误</p>
<p><a href="https://i.stack.imgur.com/F8RgB.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/F8RgB.png" alt="enter image description here"/></a></p>
<p>下面是我在colab笔记本上运行时出现的错误:</p>
<pre><code>Exception in thread Thread-14:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
task = get()
File "/usr/lib/python3.7/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/reductions.py", line 287, in rebuild_storage_fd
storage = cls._new_shared_fd(fd, size)
RuntimeError: unable to mmap 1024 bytes from file <filename not specified>: Cannot allocate memory (12)
</code></pre>
<p>有人会认为我只是没有足够的内存,但这是我的系统监视器约3秒。错误发生后:</p>
<p><a href="https://i.stack.imgur.com/ADs7s.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/ADs7s.png" alt="enter image description here"/></a></p>
<p>这是我在开始训练和出错之间获得的最低可用或可用内存:</p>
<p><a href="https://i.stack.imgur.com/U6r5w.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/U6r5w.png" alt="enter image description here"/></a></p>
<p>经过多次调整后,我发现由于某种原因,当我只使用最大21000的数据集训练模型时,一切都很好。如果我训练BART模型的“基本”版本或“大型cnn”版本,我不会发疯。我只是取决于数据集的大小。该错误总是发生在“在cache_dir/时从数据集文件创建要素”时</p>
<p>那么,我已经尝试了什么:</p>
<ul>
<li><p>我添加了很多交换内存(正如您在我的系统监视器屏幕截图中看到的)</p>
</li>
<li><p>将工人人数减少到1人</p>
</li>
<li><p>我将系统打开文件的硬最大值和软最大值限制(-n)提高到86000</p>
</li>
</ul>
<p><a href="https://i.stack.imgur.com/jHUVb.png" rel="nofollow noreferrer"><img src="https://i.stack.imgur.com/jHUVb.png" alt="enter image description here"/></a></p>
<p>我还试图在谷歌的colab笔记本上训练这个模型,但我遇到了同样的问题;如果数据集大小超过约21000,则训练失败。即使我将colab会话的内存增加了一倍,但仍将datset大小保持在21000限制之上一点点</p>
<p>桌面:</p>
<p>变压器4.6.0</p>
<p>simpletransformers 0.61.4</p>
<p>ubuntu 20.04.2 LTS</p>
<p>几个星期以来,我一直在努力解决这个问题,如果你们中有人知道我如何解决这个问题,我会非常高兴:)</p>
<p>(我知道这篇博文<a href="https://stackoverflow.com/questions/12667397/mmap-returns-can-not-allocate-memory-even-though-there-is-enough">mmap returns can not allocate memory, even though there is enough</a>,尽管有足够多的不幸,但它无法解决我的问题。我的vm.max_map_计数为860000)</p>