<p>我的解决方案既不高效,也不美观。但看看它是否对你有用@jeangelj。对于您提供的示例来说,它绝对有效。祝你好运!在</p>
<pre><code>import os
from random import shuffle
from difflib import SequenceMatcher
emails = [... ...] # for example the 16 email addresses you gave in your question
shuffle(emails) # everyday i'm shuffling
emails = sorted(emails) # sort that shit!
names = [email.split('@')[0] for email in emails]
T = 0.7 # <- set your string similarity threshold here!!
split_indices=[]
for i in range(1,len(emails)):
if SequenceMatcher(None, emails[i], emails[i-1]).ratio() < T:
split_indices.append(i) # we want to remember where dissimilar email address occurs
grouped=[]
for i in split_indices:
grouped.append(emails[:i])
grouped.append(emails[i:])
# now we have similar email addresses grouped, we want to find the common prefix for each group
prefix_strings=[]
for group in grouped:
prefix_strings.append(os.path.commonprefix(group))
# finally
ham=[]
spam=[]
true_ids = [names.index(p) for p in prefix_strings]
for i in range(len(emails)):
if i in true_ids:
ham.append(emails[i])
else:
spam.append(emails[i])
In [30]: ham
Out[30]: ['abc7020@gmail.com', 'attn1@gmail.com']
In [31]: spam
Out[31]:
['abc7020.10@gmail.com',
'abc7020.11@gmail.com',
'abc7020.12@gmail.com',
'abc7020.13@gmail.com',
'abc7020.14@gmail.com',
'abc7020.15@gmail.com',
'abc7020.1@gmail.com',
'attn12345678@gmail.com',
'attn1234567@gmail.com',
'attn123456@gmail.com',
'attn12345@gmail.com',
'attn1234@gmail.com',
'attn123@gmail.com',
'attn12@gmail.com']
# THE TRUTH YALL!
</code></pre>