<p>好吧,既然你坚持用正则表达式来做,你应该努力在一个调用中完成,这样你就不会受到上下文切换的惩罚。最好的方法是编写一个模式来捕获不包含数字的所有名/姓,用逗号分隔,让正则表达式引擎捕获所有这些名字,然后迭代匹配项,最后将它们映射到字典,以便可以将它们拆分为姓氏=>;名字映射:</p>
<pre><code>import collections
import re
text = "Assaf Spanier, Assaf Din, Yo9ssi Levi, Yoram bibe9rman, David levi, " \
"Bibi Netanyahu, Amnon Levi, Ehud sPanier, Barak Spa7nier, Sara Neta4nyahu"
full_name = re.compile(r"(?:^|\s|,)([^\d\s]+)\s+([^\d\s]+)(?=>$|,)") # compile the pattern
matches = collections.OrderedDict() # store for the last=>first name map preserving order
for match in full_name.finditer(text):
first_name = match.group(1)
print(first_name) # print the first name to match your desired output
last_name = match.group(2).title() # capitalize the last name for case-insensitivity
if last_name in matches: # repeated last name
matches[last_name].append(first_name) # add the first name to the map
else: # encountering this last name for the first time
matches[last_name] = [first_name] # initialize the map for this last name
print("========") # print the separator...
# finally, print all the repeated last names to match your format
for k, v in matches.items():
if len(v) > 1: # print only those with more than one first name attached
print(k)
</code></pre>
<p>这会给你:</p>
^{pr2}$
<p>另外,在<code>matches</code>中有完整的姓氏=>;名字匹配。在</p>
<p>说到图案,让我们一块一块地分解:</p>
<pre>
(?:^|\s|,) - match the beginning of the string, whitespace or a comma (non-capturing)
([^\d\,]+) - followed by any number of characters that are not not digits or whitespace
(capturing)
\s+ - followed by one or more whitespace characters (non-capturing)
([^\d\s]+) - followed by the same pattern as for the first name (capturing)
(?=>$|,) - followed by a comma or end of the string (look-ahead, non-capturing)
</pre>
<p>当我们迭代匹配项时,<code>match</code>对象中会引用这两个捕获的组(名字和姓氏)。别紧张。在</p>