如何修改csv文件中的重复字段?

2024-10-01 13:41:06 发布

您现在位置:Python中文网/ 问答频道 /正文

我想修改csv文件中的字段email,例如mycsv_file.csv

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com

读取csv文件的代码:

import csv

with open('mycsv_file.csv', 'r') as csv_file: 
     spamreader = csv.reader(csv_file)
     for line in spamreader:
         ord = next.spamreader
         for k in ored:       
            if line[0]==k[0]:
               line[0]==????

我想要的结果:

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com

Tags: 文件csvincomforemaillinejohn
3条回答

在一个循环中读取、检查和写入新文件。你知道吗

from csv import reader,writer
names = []
with open("Emails") as fin, open("Emails_New","w") as fout:                                                                                      
    spamreader = reader(fin, delimiter="@")                                                                                                      
    spamwriter = writer(fout, delimiter="@")                                                                                                     
    for name, domain in spamreader:                                                                                                              
        names.append(name)                                                                                                                       
        if names.count(name) > 1:                                                                                                                
            new_name = name + str(names.count(name) - 1)
        else:
            new_name = name
        w = spamwriter.writerow([new_name, domain])

$ cat Emails
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com
mary@gmail.com

$ cat Emails_New
john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com
mary2@gmail.com

我会在字典结构中跟踪已知的地址,如果我以前见过地址的话,会把数字附加到地址上。你知道吗

此解决方案将跟踪以前的地址,如果以前看到过地址,则在地址后面附加一个数字。你知道吗

addresses = []  # [ "user@host.com"]
known_addresses = {}  # { "user@host.com": 0 }

with open('mycsv_file.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for line in reader:
        address = line[0]
        if address in known_addresses:
            known_addresses[address] += 1
            email, host = address.split("@")
            number = str(known_addresses[address])
            address = email + number + '@' + host
        else:
            known_addresses[address] = 0
        addresses.append(address)

但是,它不知道递增的地址是否会出现在列表的后面,因此仍然可能有重复的地址。你知道吗

例如,如果您的列表

mary@gmail.com
mary@gmail.com
mary1@gmail.com

你会得到结果的

mary@gmail.com
mary1@gmail.com
mary1@gmail.com

如果要确保处理后所有地址都是唯一的,而不丢失原始地址集中的任何地址,可以读取所有地址并对其进行处理以增加任何重复的地址。你知道吗

# all read addresses from file, keeping track of duplication
addresses = {} # { "user@host.com": 0 }

# addresses which have had duplication removed
processed_addresses = set()s

with open('mycsv_file.csv', 'r') as csv_file:
    reader = csv.reader(csv_file)
    for line in reader:
        address = line[0]
        if address in addresses:
            addresses[address] += 1
        else:
            addresses[address] = 1

for address, count in addresses.items(): # .iteritems() if python 2.7
    num = 1
    for _ in range(count):
        if address not in processed_addresses:
            processed_addresses.add(address)
        else:
            parts = address.split('@')
            added = False
            while not added:
                tentative_address = parts[0] + str(num) + '@' + parts[1]
                if tentative_address not in processed_addresses:
                    processed_addresses.add(tentative_address)
                    added = True
                num += 1

给定输入

mary@gmail.com
mary@gmail.com
mary1@gmail.com

这将产生

mary@gmail.com
mary1@gmail.com
mary11@gmail.com

如果需要地址列表,可以使用以下函数将已处理的条目集转换为列表。你知道吗

addresses = list(processed_addresses)

您可以使用^{}来跟踪电子邮件地址到目前为止被看到的次数,并知道要将哪个数字附加为后缀以使其唯一。为了说明这一点,我在示例输入的末尾添加了一行,现在是:

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john@gmail.com
mary@gmail.com
klarck@gmail.com
mary@gmail.com,third occurrence

代码如下:

import csv
from collections import Counter

# Note: For Python 2.x, use "open('mycsv_file.csv', 'rb')" below.
with open('mycsv_file.csv', 'r', newline='') as csv_file:
     occurrences = Counter()
     for line in csv.reader(csv_file):
         email = line[0]
         if email in occurrences:
            head, tail = email.split('@')
            print('{}@{}'.format(head+str(occurrences[email]), tail))
            occurrences[email] += 1
         else:
            print('{}'.format(email))
            occurrences[email] = 1

输出(注意末尾生成的mary2@gmail.com,因为它已经出现了两次):

john@gmail.com
mary@gmail.com
klarck@gmail.com
ralf@gmail.com
john1@gmail.com
mary1@gmail.com
klarck1@gmail.com
mary2@gmail.com

相关问题 更多 >