从电子邮件中提取和统计域地址邮件

2024-10-05 14:26:21 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个电子邮件列表,只想提取域名,并计算每个域名出现的次数:

电子邮件:

best@yahoo.com

hello@gmail.com

everybody@gmail.com

bye@gmail.com

day@yahoo.com

table.blue@gmail.com

life@yahoo.com

脚本:

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')

    for line in texte:
        newline = re.search("@[\w.]+", line)
        newmail = newline.group()

        mails_value = Counter(newmail).most_common()

        print (mails_value)

输出:

[('@', 1), ('g', 1), ('6', 1), ('5', 1), ('.', 1), ('f', 1), ('r', 1)]

Traceback (most recent call last):

File "counting.py", line 10, in

newmail = newline.group()

AttributeError: 'NoneType' object has no attribute 'group'

输出良好:

@yahoo.com 3

@gmail.com 4


Tags: inimportrecom电子邮件linecounternewline
3条回答

您已经非常接近了—不需要将文件拆分为行,只需使用re.findallre.MULTILINE和模式@(.*)$

import re
import collections

with open("mails.txt") as f:
    text = f.read()
domains = re.findall(r'@(.*)$', text, re.MULTILINE)
mails_value = collections.Counter(domains) 
# outputs with example: Counter({'gmail.com': 4, 'yahoo.com': 3})

正则表达式可以避免创建不必要的列表。你知道吗

import re
from collections import Counter

with open("mails.txt", "r") as f:
    texte = f.read().split('\n')
    l=[]
    for line in texte:
        p=re.compile("(?<=@)[^.]+(?=\.)")
        newline = p.search(line)
        if(newline):

            newmail = newline.group(0)
            l.append(newmail)

Counter(l)

输出

Counter({'gmail': 4, 'yahoo': 3})

你不需要正则表达式。如果您相信所有输入都是格式良好的电子邮件,那么这就足够了:

from collections import defaultdict

domain_count = defaultdict(lambda: 0)

with open("mails.txt", "r") as f:
    texte = f.readlines()

    for line in texte:
        domain = line.split('@')[-1]
        domain_count[domain] += 1

print (domain_count)

相关问题 更多 >