Python循环遍历两个csv文件，以比较每个文件中的重复条目数

import csv cred = open("AllCredits.csv", "r") creader = csv.reader(cred) pur = open("AllPurchases.csv", "r") preader = csv.reader(pur) out = open("output.txt", "r+") for row in creader: tn = #current phone number crednum = #number of rows with that phone number for row in preader: purnum = #number of rows with that phone number if crednum != 2*(purnum): out.write(str(tn) + "\n") cred.close() pur.close() out.close()

3条回答

网友

1楼 · 编辑于 2024-10-02 10:21:28

import csv

cred = open("AllCredits.csv", "r")
creader = csv.reader(cred)

pur = open("AllPurchases.csv", "r")
preader = csv.reader(pur)

out = open("output.txt", "r+")

def x(reader):  # function takes in a reader 
    dictionary = {} # this is a python date type of key value pairs
    for row in reader: # for each row in the reader
        number = row[0] # take the first element in the row (the number)
        if number == 'TN': # skip the headers
            continue
        number = int(number) #convert it to number now ('TN' cannot be converted which is why we do it after)
        if number in dictionary: # if the number appears alreader
            dictionary[number] = dictionary[number]+1 # increment it
        else:
            dictionary[number] = 1 # else store it in the dictionary as 1
    return dictionary # return the dictionary

def assertDoubles(credits, purchases):
    outstr = ''
    for key in credits:
        crednum = credits[key]
        if crednum != 2*purchases[key]:
            outstr += str(key) + '\n'
            print(key)
    out.write(outstr)

credits = x(creader)
purchases = x(preader)

assertDoubles(credits,purchases)


#print(credits)
#print('-------')
#print(purchases)

cred.close()
pur.close()
out.close()

我写了一些代码。它本质上是将要查找的重复项的数字作为一个键存储在字典中。存储的值是该数字在文件中出现的次数。它跳过第一行（标题）。在

输出如下：

^{pr2}$

上面的新代码只是输出： 3654个

编辑：我更新了代码以修复您所引用的内容。在

网友

2楼 · 编辑于 2024-10-02 10:21:28

由于您对新条目不感兴趣，所以您只需运行第一个文件并收集第一列中的所有条目（在进程中对它们进行计数），然后运行第二个文件，检查在第一步中是否收集了第一列中的任何条目，如果是，也要对它们进行计数。您无法避免运行必要数量的循环来读取这两个文件的所有行，但您可以使用hashmap（dict）进行快速查找，因此：

import csv
import collections

c_phones = collections.defaultdict(int)  # initiate a 'counter' dict to save us some typing

with open("AllCredits.csv", "r") as f:  # open the file for reading
    reader = csv.reader(f)  # create a CSV reader
    next(reader)  # skip the first row (header)
    for row in reader:  # iterate over the rest
        c_phones[row[0]] += 1  # increase the count of the current phone

既然您已经统计了存储在c_phones字典中的第一个文件中的所有电话号码，那么您应该克隆它，但要重置计数器，以便可以在第二个CSV文件中统计这些号码的出现次数：

^{pr2}$

现在你有两个字典，你有两个计数，你可以很容易地迭代它们，打印出计数

for key in c_phones:
    print("{:<15} Credits: {:<4} Purchases: {:<4}".format(key, c_phones[key], p_phones[key]))

根据您的示例数据，将得出：

3654            Credits: 1    Purchases: 1   
2476            Credits: 2    Purchases: 1

网友

3楼 · 编辑于 2024-10-02 10:21:28

为了帮助理解，我将这个问题分解成更小、更易于管理的任务：

从两个已排序的csv文件的第一列中读取电话号码。在
查找出现在两个电话号码列表中的重复号码。在

读取电话号码是一个可重复使用的功能，因此我们将其分开：

def read_phone_numbers(file_path):
    file_obj = open(file_path, 'r')

    phone_numbers = []
    for row in csv.reader(file_obj):
        phone_numbers.append(row[0])

    file_obj.close()
    return phone_numbers

对于查找重复项的任务，^{}是一个有用的工具。来自python文档：

A set is an unordered collection with no duplicate elements.

^{pr2}$

总而言之：

def main(credit_csv_path, purchase_csv_path, out_csv_path):
    credit_nums = read_phone_numbers(credit_csv_path)
    purchase_nums = read_phone_numbers(purchase_csv_path)
    duplicates = find_duplicates(credit_nums, purchase_nums)

    with open(out_csv_path, 'w') as file_obj:
        writer = csv.DictWriter(
            file_obj,
            fieldnames=['phone_number', 'credit_count', 'purchase_count'],
        )
        writer.writerows(duplicates)

如果需要处理数百倍大的文件，可以查看the ^{} module。在

相关问题更多 >

编程相关推荐

热门问题

热门文章