为什么数据字典的内存开销是磁盘上相同数据的两倍？

def load_by_geohash(file, specificity=7): results = defaultdict(list) filename = os.path.join(DATADIR, file) with open(filename, 'r') as f: updates = (json.loads(line) for line in f) for update in updates: geo_hash = update['geohash'][:specificity] results[geo_hash].append(update) return results

1条回答

网友

1楼 · 发布于 2024-05-02 21:06:06

是的，很容易做到。考虑一个简单的字符串列表：

>>> import json
>>> from sys import getsizeof
>>> x = ['a string', 'another string', 'yet another']
>>> sum(map(getsizeof, x)) + getsizeof(x)
268
>>> len(json.dumps(x).encode())
45
>>>

在Python中，一切都是一个对象。所以每个（好吧，大多数）单独的对象至少有开销。注意，我的系统中有一个空字符串：

>>> getsizeof('')
49

注意，对于dict对象，这种差异更大，请考虑：

>>> d
{'a': 'a string', 'b': 'another string', 'c': 'yet another'}
>>> sum(map(getsizeof, d)) + sum(map(getsizeof, d.values())) + getsizeof(d)
570
>>> len(json.dumps(d).encode())
60

对于一个空的dict来说这是非常巨大的：

>>> getsizeof({}), len(json.dumps({}).encode())
(240, 2)

现在，有各种各样的选项可以更紧凑地存储数据。但这取决于您的用例。你知道吗

Here是一个与许多词典的内存使用有关的问题。还有一个使用numpy数组和namedtuple对象更紧凑地存储数据的示例。注意，使用namedtuple对象可能是您所需要的，内存节省可能是巨大的，因为您不需要为键存储实际的字符串对象。如果子字典结构是规则的，我建议用嵌套的namedtuple对象替换那些嵌套的updatedict。你知道吗

相关问题更多 >

编程相关推荐

热门问题

热门文章