Python比较多个文本文件的ngram

import nltk from nltk.util import ngrams text1 = 'Hello my name is Jason' text2 = 'My name is not Mike' n = 3 trigrams1 = ngrams(text1.split(), n) trigrams2 = ngrams(text2.split(), n) print(trigrams1) for grams in trigrams1: print(grams) def compare(trigrams1, trigrams2): for grams1 in trigrams1: if each_gram in trigrams2: print (each_gram) return False

3条回答

网友

1楼 · 编辑于 2024-10-01 13:40:11

我正在做一个和你的非常相似的任务，这时我遇到了这个旧线程，它似乎工作得很好，只是有一个bug。我会在这里加上这个答案，以防别人无意中发现。来自nltk.util的ngrams返回生成器对象，而不是列表。它需要转换为一个列表才能使用您编写的compare函数。使用lower()进行不区分大小写的匹配。在

完整示例：

import nltk
from nltk.util import ngrams

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

n = 3
trigrams1 = ngrams(text1.lower().split(), n)
trigrams2 = ngrams(text2.lower().split(), n)

def compare_ngrams(trigrams1, trigrams2):
    trigrams1 = list(trigrams1)
    trigrams2 = list(trigrams2)
    common=[]
    for gram in trigrams1:
        if gram in trigrams2:
            common.append(gram)
    return common

common = compare_ngrams(trigrams1, trigrams2)
print(common)

输出：

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 13:40:11

我认为把ngrams中的元素连接起来，列一个字符串列表，然后进行比较，可能会更容易些。在

让我们用您提供的示例来回顾这个过程。在

text1 = 'Hello my name is Jason'
text2 = 'My name is not Mike'

在应用nltk中的ngrams函数后，您将得到以下两个列表，我与之前一样将它们命名为text1和{}：

^{pr2}$

当你想比较ngram时，你应该把所有的元素都小写，以免它把'my'和{}作为单独的标记，这显然是我们不想要的。在

下面的函数就是这样做的。在

def append_elements(n_gram):
    for element in range(len(n_gram)):
            phrase = ''
            for sub_element in n_gram[element]:
                    phrase += sub_element+' '
            n_gram[element] = phrase.strip().lower()
    return n_gram

所以如果我们输入text1，我们得到{}，这更容易处理。在

接下来我们生成compare函数。你认为我们可以用一个列表来存储共同点是对的。我把它命名为common这里：

def compare(n_gram1, n_gram2):
    n_gram1 = append_elements(n_gram1)
    n_gram2 = append_elements(n_gram2)
    common = []
    for phrase in n_gram1:
        if phrase in n_gram2:
            common.append(phrase)
    if not common:
        return False
        # or you could print a message saying no commonality was found
    else:
        for i in common:
            print(i)

if not common表示如果common列表为空，在这种情况下，它将打印一条消息或返回False

在您的例子中，当我们运行compare(text1, text2)时，唯一的共同点是：

>>> 
my name is
>>>

这是正确的答案。在

网友

3楼 · 编辑于 2024-10-01 13:40:11

在compare函数中使用一个列表，比如common。将每个ngram附加到这两个trigram通用的列表中，最后将列表返回为：

>>> trigrams1 = ngrams(text1.lower().split(), n)  # use text1.lower() to ignore sentence case.
>>> trigrams2 = ngrams(text2.lower().split(), n)  # use text2.lower() to ignore sentence case.
>>> trigrams1
[('hello', 'my', 'name'), ('my', 'name', 'is'), ('name', 'is', 'jason')]
>>> trigrams2
[('my', 'name', 'is'), ('name', 'is', 'not'), ('is', 'not', 'mike')]
>>> def compare(trigrams1, trigrams2):
...    common=[]
...    for grams1 in trigrams1:
...       if grams1 in trigrams2:
...         common.append(grams1)
...    return common
... 
>>> compare(trigrams1, trigrams2)
[('my', 'name', 'is')]

相关问题更多 >

编程相关推荐

热门问题

热门文章