擅长:python、mysql、java
<p>感谢Ciprian Tomiagă指点我到帖子<a href="https://stackoverflow.com/questions/42742810/speed-up-millions-of-regex-replacements-in-python-3">Speed up millions of regex replacements in Python 3</a>。Eric Duminil提供的选项,请参阅“如果您想要最快的解决方案,请使用此方法(使用set lookup)”,在Pandas环境中使用series(而不是list)同样有效—下面重复此问题的示例代码,在我的大数据集上,整个过程在2.54秒内完成!在</p>
<p>输入:</p>
<pre><code>import re
banned_words = set(word.strip().lower() for word in xwords)
def delete_banned_words(matchobj):
word = matchobj.group(0)
if word.lower() in banned_words:
return ""
else:
return word
sentences = df["Text"]
word_pattern = re.compile('\w+')
df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences]
print(df)
</code></pre>
<p>输出:</p>
^{pr2}$