将大型python数据保存到数据库中的最小内存消耗方法

2024-09-17 02:05:11 发布

您现在位置:Python中文网/ 问答频道 /正文

我必须将一个非常大的python数据保存到mysql数据库中,该数据库由列表和字典组成,但在保存操作期间出现内存异常

我已经对保存操作进行了基准测试,还尝试了不同的转储数据的方法,包括二进制格式,但所有方法似乎都消耗了大量内存。以下基准:

JSON保存期间的最大内存使用量: 966.83MB

倾倒后的尺寸 json:81.03MB 泡菜:66.79 MB msgpack:33.83 MB

压缩时间: json:5.12s 泡菜:11.17秒 msgpack:0.27s

解压缩时间: json:2.57s 泡菜:1.66s msgpack:0.52s

压缩最大内存使用率: json转储:840.84MB 泡菜:1373.30MB msgpack:732.67MB

解压缩最大内存使用率: json:921.41MB 泡菜:1481.25MB msgpack:1006.12MB

msgpack似乎是性能最好的库,但解压也会占用大量内存。 我还尝试了hickle,据说它消耗的内存很少,但最终的大小是800MB

有人有什么建议吗?我应该增加内存限制吗?mongodb可以用更少的内存处理保存操作吗

在stacktrace下面找到

Traceback (most recent call last):
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 139, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 202, in encode
    return ''.join(chunks)
MemoryError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/background_task/tasks.py", line 43, in bg_runner
    func(*args, **kwargs)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/call_kernel.py", line 157, in start_simulation
    simulation_job_object.save()
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 172, in save
    self.clean_fields()
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/base.py", line 1223, in clean_fields
    setattr(self, f.attname, f.clean(raw_value, self))
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/django/db/models/fields/__init__.py", line 630, in clean
    self.validate(value, model_instance)
  File "/opt/python/run/venv/local/lib/python3.6/site-packages/jsonfield/fields.py", line 54, in validate
    self.get_prep_value(value)
  File "/opt/python/bundle/32/app/web_platform/kernel/kernel_worker/web_platform/models.py", line 156, in get_prep_value
    return json.dumps(value, **self.encoder_kwargs)
  File "/usr/lib64/python3.6/json/__init__.py", line 231, in dumps
    return _default_encoder.encode(obj)
  File "/usr/lib64/python3.6/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.6/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
MemoryError

我的代码示例

class Calculation(Model):
        name = db_models.CharField(max_length=120)
        results = JsonNullField(blank=True, null=True)

results = run_calculation()
calculation = Calculation(name="calculation", results=results)
calculation.save()

Tags: 内存inpyselfcleanwebjsonvalue
1条回答
网友
1楼 · 发布于 2024-09-17 02:05:11

本质上,我将如何减少内存消耗并提高性能:

  1. 加载json文件(无法在python AFAIK中对其进行流式处理)
  2. 将字典数组分成更小的块
  3. 将块转换为对象
  4. 调用bulk_create
  5. 每次循环迭代后进行垃圾收集
import json
import gc
from myapp.models import MyModel

filename = '/path/to/data.json'
with open(filename, 'r') as f:
    data = json.load(f)
chunk_size = 100
while data:
    chunk = data[:chunk_size]
    data = data[chunk_size:]
    chunk = [ MyModel(**x) for x in chunk ]
    MyModel.objects.bulk_create(chunk)
    gc.collect()

您可以使用chunk_size来优化性能/内存消耗

相关问题 更多 >