在文本中查找新插入的单词

2024-09-24 04:25:36 发布

您现在位置:Python中文网/ 问答频道 /正文

我想用Python查找插入到文本文件中的生词。例如:

Old: He is a new employee here.
New: He was a new, employee there.

我想要这个单词列表作为输出:['was', ',' ,'there']

我使用了difflib,但是它用'+', '-' and '?'以一种格式错误的方式给出了差异。我必须解析输出才能找到新词。有没有一种简单的方法可以在Python中实现这一点?你知道吗


Tags: and列表newhereisemployee单词old
2条回答

我用了googlediff补丁匹配。很好用。你知道吗

您可以使用^{}模块来实现这一点。你知道吗

import re

# create a regular expression object
regex = re.compile(r'(?:\b\w{1,}\b)|,')

# the inputs
old = "He is a new employee here."
new = "He was a new, employee there."

# creating lists of the words (or commas) in each sentence
old_words = re.findall(regex, old)
new_words = re.findall(regex, new)

# generate a list of words from new_words if it isn't in the old words
# also checking for words that previously existed but are then added
word_differences = []
for word in new_words:
    if word in old_words:
        old_words.remove(word)
    else:
        word_differences.append(word)

# print it out to verify
print word_differences

请注意,如果要添加其他标点符号(如bang或分号),则必须将其添加到正则表达式定义中。现在,它只检查单词或逗号。

相关问题 更多 >