是什么占用了这么多的记忆?

2024-06-24 13:08:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个简单的代码,它读取csv文件,根据前两列查找重复项,然后将重复项写入另一个csv,并在第三个csv中保留唯一值。。。你知道吗

我正在使用set:

def my_func():
    area = "W09"

    inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
    out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
    out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"



    #i = 0
    seen = set()

    with open(inf, 'r') as infile, open(out, 'w') as outfile1, open(out2, 'w') as outfile2:
        reader = csv.reader(infile, delimiter=" ")
        writer1 = csv.writer(outfile1, delimiter=" ")
        writer2 = csv.writer(outfile2, delimiter=" ")
        for row in reader:
            x, y = row[0], row[1]

            x = float(x)
            y = float(y)

            if (x, y) in seen:

                writer2.writerow(row)
                continue
            seen.add((x, y))
            writer1.writerow(row)



    seen.clear()

我想,这个集合是最好的选择,但是集合的大小是输入文件大小的七倍?(输入文件从140MB到50GBCSV)和RAM使用率从1GB到近400GB(我使用的服务器有768GB的RAM):

我还在小样本上使用了profiler

Line #    Mem usage    Increment   Line Contents

 8   21.289 MiB   21.289 MiB   @profile
 9                             def my_func():
10   21.293 MiB    0.004 MiB       area = "W10"
11
12   21.293 MiB    0.000 MiB       inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13   21.293 MiB    0.000 MiB       out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14   21.297 MiB    0.004 MiB       out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17
18                                 #i = 0
19   21.297 MiB    0.000 MiB       seen = set()
20
21   21.297 MiB    0.000 MiB       with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
22   21.297 MiB    0.000 MiB           reader = csv.reader(infile, delimiter=" ")
23   21.297 MiB    0.000 MiB           writer1 = csv.writer(outfile1, delimiter=" ")
24   21.297 MiB    0.000 MiB           writer2 = csv.writer(outfile2, delimiter=" ")
25 1089.914 MiB   -9.008 MiB           for row in reader:
26 1089.914 MiB   -7.977 MiB               x, y = row[0], row[1]
27
28 1089.914 MiB   -6.898 MiB               x = float(x)
29 1089.914 MiB  167.375 MiB               y = float(y)
30
31 1089.914 MiB  166.086 MiB               if (x, y) in seen:
32                                             #z = line.split(" ",3)[-1]
33                                             #if z == "5284":
34                                             #    print X, Y, z
35
36 1089.914 MiB    0.004 MiB                   writer2.writerow(row)
37 1089.914 MiB    0.000 MiB                   continue
38 1089.914 MiB  714.102 MiB               seen.add((x, y))
39 1089.914 MiB   -9.301 MiB               writer1.writerow(row)
40
41
42
43  690.426 MiB -399.488 MiB       seen.clear()

有什么问题吗?有没有更快的方法过滤掉结果? 还是一种使用较少的方式?你知道吗

csv示例: 我们正在看GeoTIFF转换成csv文件,所以它是X Y值

    475596 101832 4926
    475626 101832 4926
    475656 101832 4926
    475686 101832 4926
    475716 101832 4926
    475536 101802 4926
    475566 101802 4926
    475596 101802 4926
    475626 101802 4926
    475656 101802 4926
    475686 101802 4926
    475716 101802 4926
    475746 101802 4926
    475776 101802 4926
    475506 101772 4926
    475536 101772 4926
    475566 101772 4926
    475596 101772 4926
    475626 101772 4926
    475656 101772 4926
    475686 101772 4926
    475716 101772 4926
    475746 101772 4926
    475776 101772 4926
    475806 101772 4926
    475836 101772 4926
    475476 101742 4926
    475506 101742 4926

编辑: 所以我尝试了琼提供的解决方案: https://stackoverflow.com/a/49008391/9418396

结果是,在我的140mbcsv的小集上,集的大小现在是原来的一半,这是一个很好的改进。我会试着在更大的数据上运行它,看看它能做什么。我不能将它真正链接到探查器,因为探查器会将执行时间延长大量时间。你知道吗

Line #    Mem usage    Increment   Line Contents

 8   21.273 MiB   21.273 MiB   @profile
 9                             def my_func():
10   21.277 MiB    0.004 MiB       area = "W10"
11
12   21.277 MiB    0.000 MiB       inf = r'f:\JDo\Cleaned\_merged\\'+ area +'.csv'
13   21.277 MiB    0.000 MiB       out  = r'f:\JDo\Cleaned\_merged\no_duplicates\\'+area+'_no_duplicates.csv'
14   21.277 MiB    0.000 MiB       out2 = r'f:\JDo\Cleaned\_merged\duplicates\\'+area+"_duplicates.csv"
15
16
17   21.277 MiB    0.000 MiB       seen = set()
18
19   21.277 MiB    0.000 MiB       with open(inf, 'r') as infile, open(out,'w') as outfile1, open(out2, 'w') as outfile2:
20   21.277 MiB    0.000 MiB           reader = csv.reader(infile, delimiter=" ")
21   21.277 MiB    0.000 MiB           writer1 = csv.writer(outfile1, delimiter=" ")
22   21.277 MiB    0.000 MiB           writer2 = csv.writer(outfile2, delimiter=" ")
23  451.078 MiB -140.355 MiB           for row in reader:
24  451.078 MiB -140.613 MiB               hash = float(row[0])*10**7 + float(row[1])
25                                         #x, y = row[0], row[1]
26
27                                         #x = float(x)
28                                         #y = float(y)
29
30                                         #if (x, y) in seen:
31  451.078 MiB   32.242 MiB               if hash in seen:
32  451.078 MiB    0.000 MiB                   writer2.writerow(row)
33  451.078 MiB    0.000 MiB                   continue
34  451.078 MiB   78.500 MiB               seen.add((hash))
35  451.078 MiB -178.168 MiB               writer1.writerow(row)
36
37  195.074 MiB -256.004 MiB       seen.clear()

Tags: csvinasareaopenmergedfloatreader
1条回答
网友
1楼 · 发布于 2024-06-24 13:08:06

您可以创建自己的散列函数,以避免存储tuple个浮点值,而是以一种独特的方式将浮点值组合在一起。你知道吗

假设坐标不能超过1000万(也许你可以降到100万),你可以:

hash = x*10**7 + y

(这在浮点数上执行一种逻辑“或”,由于值是有限的,所以xy之间不会混淆)

然后把hash放在你的集合中,而不是tuple的浮点数。使用10**14没有浮动吸收的风险,值得一试:

>>> 10**14+1.5
100000000000001.5

然后循环变为:

    for row in reader:
        hash = float(row[0])*10**7 + float(row[1])

        if hash in seen:
            writer2.writerow(row)
            continue
        seen.add(hash)
        writer1.writerow(row)

一个浮点,即使很大(因为浮点的大小是固定的),在内存中也至少比2个浮点的tuple小2到3倍。在我的机器上:

>>> sys.getsizeof((0.44,0.2))
64
>>> sys.getsizeof(14252362*10**7+35454555.0)
24

相关问题 更多 >