比较python中两个巨大csv文件的最快方法（numpy）

2条回答

网友

1楼 · 编辑于 2024-10-06 14:21:10

我将给出一些Python伪代码的一般解决方案。你要解决的是这本书"Programming Pearls" by Jon Bentley中的经典问题。在

只需一个简单的位数组就可以非常有效地解决这个问题，因此我的评论是，电话号码有多长（有多少位数）。在

假设电话号码的长度最多为10位，而您可以拥有的最大电话号码是：9 999 999 999（使用空格来提高可读性）。在这里，我们可以使用每个数字1位来标识数字是否在集合中（位分别设置或未设置），因此我们将使用9 999 999 999位来标识每个数字，即：

bits[0]标识号码0 000 000 000
bits[193]标识号码0 000 000 193
{6567}将由

为此，我们需要预先分配9 999 999 999位，初始设置为0，即：9 999 999 999 / 8 / 1024 / 1024=大约1.2GB内存。在

我认为在末尾保留数字的交集将比位表示占用更多的空间=>；最多存储600k个int=>；64bit * 600k=大约4.6gb（actually int is not stored that efficiently and might use much more），如果这些是字符串，则可能会导致更多的内存需求。在

从CSV文件（逐行或缓冲文件读取器）解析电话号码字符串，将其转换为数字，比进行固定时间内存查找要快得多，这比处理字符串和合并字符串要快。不幸的是，我没有这些电话号码文件要测试，但我想听听你的发现。在

from bitstring import BitArray


max_number = 9999999999

found_phone_numbers = BitArray(length=max_number+1)


# replace this function with the file open function and retrieving
#  the next found phone number
def number_from_file_iteator(dummy_data):
    for number in dummy_data:
        yield number


def calculate_intersect():
    # should be open a file1 and getting the generator with numbers from it
    #  we use dummy data here
    for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]):
        found_phone_numbers[number] = True

    # open second file and check if the number is there
    for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]):
        if found_phone_numbers[number]:
            yield number


number_intersection = set(calculate_intersect())

print number_intersection

我使用了来自bitstring pip package的BitArray，初始化整个位串大约需要2秒。之后，扫描文件将使用恒定内存。{13}用于存储项目。在

注意1:这个算法可以修改为只使用list。在这种情况下，一旦位号匹配，就必须重置第二个循环，这样重复项就不再匹配了。在

注意2:存储在set/list中时会出现延迟，因为我们在第二个for循环中使用了生成器。运行时复杂性是线性的，即O(N)。在

网友

2楼 · 编辑于 2024-10-06 14:21:10

将60万个电话号码读入set。在
逐行输入较大的文件，对照集合检查每一行。在
立即将匹配项写入输出文件。在

这样你就不必一次将所有数据加载到内存中。在

相关问题更多 >

编程相关推荐

热门问题

热门文章