python:欺诈电子邮件地址的数据清理检测模式

2024-09-27 19:26:44 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在清理一个数据集与欺诈电子邮件地址,我要删除。在

我建立了多个规则来捕捉重复和欺诈域名。但是有一个screnario,我想不出如何用python编写一个规则来标记它们。在

我有这样的规则:

#delete punction
df['email'].apply(lambda x:''.join([i for i in x if i not in string.punctuation]))    

#flag yopmail
pattern = "yopmail"
match = df['email'].str.contains(pattern)
df['yopmail'] = np.where(match, 'Y', '0')

#flag duplicates
df['duplicate']=df.email.duplicated(keep=False)

这是我找不到规则来捕捉它的数据。基本上,我正在寻找一种方法来标记以相同方式开始的地址,但最后有连续的数字。在

^{pr2}$

Tags: 数据in标记df规则电子邮件email地址
3条回答

可以使用正则表达式执行此操作;示例如下:

import re

a = "attn12345@gmail.comf"
b = "abc7020.14@gmail.com"
c = "abc7020@gmail.com"
d = "attn12345678@gmail.com"

pattern = re.compile("[0-9]{3,500}\.?[0-9]{0,500}?@")

if pattern.search(a):
    print("spam1")

if pattern.search(b):
    print("spam2")

if pattern.search(c):
    print("spam3")

if pattern.search(d):
    print("spam4")

如果运行代码,您将看到:

^{pr2}$

此方法的好处是其标准化(正则表达式),并且您可以通过调整{}中的值来轻松调整匹配的强度,这意味着您可以拥有一个全局配置文件,在其中设置/调整这些值。您还可以轻松地调整正则表达式,而不必重写代码。在

首先看一下regexp问题here

其次,尝试像这样过滤电子邮件地址:

# Let's email is = 'attn1234@gmail.com'
email = 'attn1234@gmail.com'
email_name = email.split(',', maxsplit=1)[0]
# Here you get email_name = 'attn1234
import re
m = re.search(r'\d+$', email_name)
# if the string ends in digits m will be a Match object, or None otherwise.
if m is not None:
    print ('%s is good' % email)
else:
    print ('%s is BAD' % email) 

我的解决方案既不高效,也不美观。但看看它是否对你有用@jeangelj。对于您提供的示例来说,它绝对有效。祝你好运!在

import os
from random import shuffle
from difflib import SequenceMatcher

emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]

T = 0.7 # <- set your string similarity threshold here!!

split_indices=[]
for i in range(1,len(emails)):
    if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
        split_indices.append(i) # we want to remember where dissimilar email address occurs

grouped=[]
for i in split_indices:
    grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
    prefix_strings.append(os.path.commonprefix(group))

# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
    if i in true_ids:
        ham.append(emails[i])
    else:
        spam.append(emails[i])

In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']

In [31]: spam
Out[31]: 
['abc7020.10@gmail.com',
 'abc7020.11@gmail.com',
 'abc7020.12@gmail.com',
 'abc7020.13@gmail.com',
 'abc7020.14@gmail.com',
 'abc7020.15@gmail.com',
 'abc7020.1@gmail.com',
 'attn12345678@gmail.com',
 'attn1234567@gmail.com',
 'attn123456@gmail.com',
 'attn12345@gmail.com',
 'attn1234@gmail.com',
 'attn123@gmail.com',
 'attn12@gmail.com']  

# THE TRUTH YALL!

相关问题 更多 >

    热门问题