Python - 需要帮助理解这两段代码的区别 - 问答

2条回答

网友

1楼 · 编辑于 2024-09-26 22:08:54

获取两袋单词之间的Jaccard距离，即2个句子的唯一词汇。在

>>> from nltk.metrics import jaccard_distance
>>> from nltk import ngrams

>>> sent1 = "This is a foo bar sentence".split()
>>> sent2 = "A bar bar black sheep have you a sentence".split()

>>> set(sent1) # A list of unique words in sent1
set(['a', 'bar', 'sentence', 'This', 'is', 'foo'])
>>> set(sent2) # A list of unique words in sent2
set(['A', 'sheep', 'bar', 'sentence', 'black', 'a', 'have', 'you'])

>>> jaccard_distance(set(sent1), set(sent2))
0.7272727272727273

现在，如果是一袋ngrams：

^{pr2}$

1.0的Jaccard距离是什么意思？

这意味着比较中的两个序列是完全不同的，在这种情况下，每个句子的ngram是唯一的。在

以前，我们把一个句子串分成字符串列表，当我们比较两个序列时，他们会比较句子中的单词/语法。在

现在如果我们迭代2个单词而不是句子，我们将把单词分成一个字符列表，即

^{3}$

为了得到它们之间的Jaccard距离：

>>> jaccard_distance(set(ngrams(word1, 3)), set(ngrams(word2, 3)))
0.9818181818181818

现在来回答运营商的问题：

distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
                               set(nltk.ngrams(word,    gram_number))), word)
            for word in spellings)

与

for word in spellings:
    distances = ((jaccard_distance(set(nltk.ngrams(entry, gram_number)),
                               set(nltk.ngrams(word,    gram_number))), word))

您可以尝试做的第一件事是简化代码：

使用名称空间，他们是你的朋友
如果你必须一次又一次地重复输入相同的内容，请使用函数。在
使用显式和清晰的变量名

使用名称空间

无需反复键入nltk.ngrams(...)，您可以这样做：

>>> from nltk import ngrams
>>> list(ngrams('foobar', 3))
[('f', 'o', 'o'), ('o', 'o', 'b'), ('o', 'b', 'a'), ('b', 'a', 'r')]

如果只使用2或3的n-gram顺序，即bigrams或trigrams，您可以：

^{8}$

如果你想变得花哨，为你想要的ngram顺序定制一个函数，你可以试试functools.partial：

>>> from functools import partial
>>> from nltk import ngrams

>>> octagram = partial(ngrams, n=8)

>>> word = 'Supercalifragilisticexpialidocious'
>>> octagram(word)
<generator object ngrams at 0x10cafff00>

>>> list(octagram(word))
[('S', 'u', 'p', 'e', 'r', 'c', 'a', 'l'), ('u', 'p', 'e', 'r', 'c', 'a', 'l', 'i'), ('p', 'e', 'r', 'c', 'a', 'l', 'i', 'f'), ('e', 'r', 'c', 'a', 'l', 'i', 'f', 'r'), ('r', 'c', 'a', 'l', 'i', 'f', 'r', 'a'), ('c', 'a', 'l', 'i', 'f', 'r', 'a', 'g'), ('a', 'l', 'i', 'f', 'r', 'a', 'g', 'i'), ('l', 'i', 'f', 'r', 'a', 'g', 'i', 'l'), ('i', 'f', 'r', 'a', 'g', 'i', 'l', 'i'), ('f', 'r', 'a', 'g', 'i', 'l', 'i', 's'), ('r', 'a', 'g', 'i', 'l', 'i', 's', 't'), ('a', 'g', 'i', 'l', 'i', 's', 't', 'i'), ('g', 'i', 'l', 'i', 's', 't', 'i', 'c'), ('i', 'l', 'i', 's', 't', 'i', 'c', 'e'), ('l', 'i', 's', 't', 'i', 'c', 'e', 'x'), ('i', 's', 't', 'i', 'c', 'e', 'x', 'p'), ('s', 't', 'i', 'c', 'e', 'x', 'p', 'i'), ('t', 'i', 'c', 'e', 'x', 'p', 'i', 'a'), ('i', 'c', 'e', 'x', 'p', 'i', 'a', 'l'), ('c', 'e', 'x', 'p', 'i', 'a', 'l', 'i'), ('e', 'x', 'p', 'i', 'a', 'l', 'i', 'd'), ('x', 'p', 'i', 'a', 'l', 'i', 'd', 'o'), ('p', 'i', 'a', 'l', 'i', 'd', 'o', 'c'), ('i', 'a', 'l', 'i', 'd', 'o', 'c', 'i'), ('a', 'l', 'i', 'd', 'o', 'c', 'i', 'o'), ('l', 'i', 'd', 'o', 'c', 'i', 'o', 'u'), ('i', 'd', 'o', 'c', 'i', 'o', 'u', 's')]

使用函数

不是重写set(nltk.ngrams(word, gram_number))，而是得到uco(word)：

>>> from nltk import ngrams
>>> def unique_character_octagrams(text, n=8):
...     return set(ngrams(text, n))
... 
>>> uco = unique_character_octagrams
>>> uco(word1)
set([('e', 'x', 'p', 'i', 'a', 'l', 'i', 'd'), ('S', 'u', 'p', 'e', 'r', 'c', 'a', 'l'), ('i', 'c', 'e', 'x', 'p', 'i', 'a', 'l'), ('a', 'g', 'i', 'l', 'i', 's', 't', 'i'), ('t', 'i', 'c', 'e', 'x', 'p', 'i', 'a'), ('i', 'l', 'i', 's', 't', 'i', 'c', 'e'), ('i', 'd', 'o', 'c', 'i', 'o', 'u', 's'), ('c', 'e', 'x', 'p', 'i', 'a', 'l', 'i'), ('l', 'i', 's', 't', 'i', 'c', 'e', 'x'), ('f', 'r', 'a', 'g', 'i', 'l', 'i', 's'), ('l', 'i', 'f', 'r', 'a', 'g', 'i', 'l'), ('i', 'f', 'r', 'a', 'g', 'i', 'l', 'i'), ('p', 'i', 'a', 'l', 'i', 'd', 'o', 'c'), ('a', 'l', 'i', 'f', 'r', 'a', 'g', 'i'), ('x', 'p', 'i', 'a', 'l', 'i', 'd', 'o'), ('e', 'r', 'c', 'a', 'l', 'i', 'f', 'r'), ('l', 'i', 'd', 'o', 'c', 'i', 'o', 'u'), ('g', 'i', 'l', 'i', 's', 't', 'i', 'c'), ('i', 's', 't', 'i', 'c', 'e', 'x', 'p'), ('r', 'c', 'a', 'l', 'i', 'f', 'r', 'a'), ('r', 'a', 'g', 'i', 'l', 'i', 's', 't'), ('i', 'a', 'l', 'i', 'd', 'o', 'c', 'i'), ('p', 'e', 'r', 'c', 'a', 'l', 'i', 'f'), ('a', 'l', 'i', 'd', 'o', 'c', 'i', 'o'), ('u', 'p', 'e', 'r', 'c', 'a', 'l', 'i'), ('c', 'a', 'l', 'i', 'f', 'r', 'a', 'g'), ('s', 't', 'i', 'c', 'e', 'x', 'p', 'i')])

使用显式+清除变量名

在OP中，您使用了for word in spellings来迭代拼写，但不清楚spellings是什么。如果在操作中有一个spellings的示例输入，这样答案就不需要在黑暗中猜测{}到底是什么。在

从循环和Jaccard距离的用法来看，spellings是一个单词列表，因此一个更好的变量名应该是list_of_words，并且迭代在没有注释的情况下会更清晰，例如for word in list_of_words。在

此外，entry变量也不明确，从用法来看，它很可能是您要对单词列表执行的查询，因此可能的变量名是query_word。在

def unique_character_trigrams(text, n=3):
    return set(ngrams(text, n))

uct = unique_character_trigrams

list_of_words = ['Supercalifragilisticexpialidocious', 'Honorificabilitudinitatibus']

query_word = 'Antidisestablishmentarianism'

for word in list_of_words:
    d = jaccard_distance(uct(query_word), uct(word))
    print("Comparing {} vs {}\nJaccard = {}\n".format(query_word, word, d))

[出来]：

Comparing Antidisestablishmentarianism vs Supercalifragilisticexpialidocious
Jaccard = 0.982142857143

Comparing Antidisestablishmentarianism vs Honorificabilitudinitatibus
Jaccard = 1.0

现在，回到操作题。让我们来治疗：

spelling作为x，即数字列表
entry为y，即静态数
word为num，即数字列表中的一个数字
jaccard_distanceas f，一个简单的减法函数。在

如果第一种情况，这种循环序列内联的语法是list comprehension。输出是一个生成器类型，您必须使用list具体化生成器，并且在生成器内部，每个元素都是f的输出：

>>> x = [10, 20, 30] # A list of numbers. 
>>> y = 3 # A number to compare against the list.
>>> f = lambda x, y: x - y # A simple function to do x - y
>>> f(10,3)
7
>>> f(20,3)
17
>>> result = (f(num,y) for num in x)
>>> result
<generator object <genexpr> at 0x10cafff00>
>>> list(result)
[7, 17, 27]

在第二个场景中，这是更传统的迭代方式，在循环的每次迭代中都会得到一个整数输出：

>>> for num in x:
...     result = f(num, y)
...     print(type(result), result)
... 
(<type 'int'>, 7)
(<type 'int'>, 17)
(<type 'int'>, 27)

网友

2楼 · 编辑于 2024-09-26 22:08:54

在案例1中：

距离是一个元组，包含拼写中所有单词的值例如：

(0.1111111111111111, 'hello')

(0.2222222222222222, 'world')

(0.5, 'program')

(0.2727272727272727, 'computer')

(0.0, 'spell')

在案例2中：

距离将被覆盖，因此距离将只包含最后一个值

^{pr2}$