比较python中两个巨大csv文件的最快方法（numpy）问题的回答

比较python中两个巨大csv文件的最快方法（numpy）

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我将给出一些Python伪代码的一般解决方案。你要解决的是这本书<a href="https://rads.stackoverflow.com/amzn/click/com/0201657880" rel="nofollow noreferrer">"Programming Pearls" by Jon Bentley</a>中的经典问题。在 只需一个简单的位数组就可以非常有效地解决这个问题，因此我的评论是，电话号码有多长（有多少位数）。在 假设电话号码的长度最多为10位，而您可以拥有的最大电话号码是：<code>9 999 999 999</code>（使用空格来提高可读性）。在这里，我们可以使用每个数字1位来标识数字是否在集合中（位分别设置或未设置），因此我们将使用<code>9 999 999 999</code>位来标识每个数字，即： <ul> <li><code>bits[0]</code>标识号码<code>0 000 000 000</code></li> <li><code>bits[193]</code>标识号码<code>0 000 000 193</code></li> <li>{6567}将由</li> </ul> 为此，我们需要预先分配<code>9 999 999 999</code>位，初始设置为<code>0</code>，即：<code>9 999 999 999 / 8 / 1024 / 1024</code>=大约1.2GB内存。在 我认为在末尾保留数字的交集将比位表示占用更多的空间=&gt；最多存储600k个int=&gt；<code>64bit * 600k</code>=大约4.6gb（<a href="https://stackoverflow.com/a/14329864/98693">actually int is not stored that efficiently and might use much more</a>），如果这些是字符串，则可能会导致更多的内存需求。在 从CSV文件（逐行或缓冲文件读取器）解析电话号码字符串，将其转换为数字，比进行固定时间内存查找要快得多，这比处理字符串和合并字符串要快。不幸的是，我没有这些电话号码文件要测试，但我想听听你的发现。在 <pre><code>from bitstring import BitArray max_number = 9999999999 found_phone_numbers = BitArray(length=max_number+1) # replace this function with the file open function and retrieving # the next found phone number def number_from_file_iteator(dummy_data): for number in dummy_data: yield number def calculate_intersect(): # should be open a file1 and getting the generator with numbers from it # we use dummy data here for number in number_from_file_iteator([1, 25, 77, 224322323, 8292, 1232422]): found_phone_numbers[number] = True # open second file and check if the number is there for number in number_from_file_iteator([4, 24, 224322323, 1232422, max_number]): if found_phone_numbers[number]: yield number number_intersection = set(calculate_intersect()) print number_intersection </code></pre> 我使用了来自<a href="https://pypi.python.org/pypi/bitstring/3.1.3" rel="nofollow noreferrer">bitstring pip package</a>的<code>BitArray</code>，初始化整个位串大约需要2秒。之后，扫描文件将使用恒定内存。{13}用于存储项目。在 注意1:这个算法可以修改为只使用<code>list</code>。在这种情况下，一旦位号匹配，就必须重置第二个循环，这样重复项就不再匹配了。在 注意2:存储在<code>set</code>/<code>list</code>中时会出现延迟，因为我们在第二个for循环中使用了生成器。运行时复杂性是线性的，即<code>O(N)</code>。在

比较python中两个巨大csv文件的最快方法（numpy）

1 个回答

相关Python问题