Pandas字符串，替换没有for循环的多个单词

List1 = ["George Lucas has a problem logging in", "George Clooney is trying to download data into a spreadsheet", "Bart Graham needs to logon to CRM urgently", "Lucy Anne George needs to pull management reports"] List2 = ["Access Team", "Microsoft Team", "Access Team", "Reporting Team"] df = pd.DataFrame({"Team":List2, "Text":List1}) xwords = pd.Series(["George", "Lucas", "Clooney", "Lucy", "Anne", "Bart", "Graham"]) for word in range(len(xwords)): df["Text"] = df["Text"].str.replace(xwords[word], "! ") # Just using ! in the example so one can clearly see the result

3条回答

网友

1楼 · 编辑于 2024-10-02 08:26:30

我建议将文本标记化，并使用一组名称：

xwords = set(["George", "Lucas", ...])
df["Text"] = ' '.join(filter(lambda x: x not in xwords, df["Text"].str.split(' ')))

根据字符串的不同，标记化需要比仅在空格上拆分更为精细。在

可能有一种熊猫特有的方法可以做到这一点，但我对此几乎没有经验；）

网友

2楼 · 编辑于 2024-10-02 08:26:30

在熊猫.Series.str.replace可以将已编译的正则表达式作为模式

import re
patt = re.compile(r'|'.join(xwords))
df["Text"] = df["Text"].str.replace(patt, "! ")

也许这会有帮助？不过，我对这么长的正则表达式没有经验。在

网友

3楼 · 编辑于 2024-10-02 08:26:30

感谢Ciprian Tomiagă指点我到帖子Speed up millions of regex replacements in Python 3。Eric Duminil提供的选项，请参阅“如果您想要最快的解决方案，请使用此方法（使用set lookup）”，在Pandas环境中使用series（而不是list）同样有效—下面重复此问题的示例代码，在我的大数据集上，整个过程在2.54秒内完成！在

输入：

import re

banned_words = set(word.strip().lower() for word in xwords)

def delete_banned_words(matchobj):
    word = matchobj.group(0)
    if word.lower() in banned_words:
        return ""
    else:
        return word

sentences = df["Text"]

word_pattern = re.compile('\w+')

df["Text"] = [word_pattern.sub(delete_banned_words, sentence) for sentence in sentences]
print(df)

输出：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章