检查给定ord中列表的超集

1条回答

网友

1楼 · 发布于 2024-06-28 10:21:08

我建议从标记化列表中获取一行4个标记的序列，并创建一组这些标记。通过使用Python的itertools模块，这可以非常优雅地完成：

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)

以上代码的输出（格式很好）：

[('The', 'quick', 'brown', 'fox'),
 ('quick', 'brown', 'fox', 'jumps'),
 ('brown', 'fox', 'jumps', 'over'),
 ('fox', 'jumps', 'over', 'the'),
 ('jumps', 'over', 'the', 'lazy'),
 ('over', 'the', 'lazy', 'dog')]

实际上，更优雅的是：

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)

输出和以前一样。你知道吗

现在您已经为每个列表列出了四个连续的标记（作为4元组），您可以将这些标记粘贴到一个集合中，并检查相同的4元组是否出现在两个不同的集合中：

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))

other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))

print set1.intersection(set2) # Empty set
if set1.intersection(set2):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Nothing in common"

third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))

print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Found something in common"

注意：如果您使用的是python3，只需将所有print "Something"语句替换为print("Something")：在python3中，print成为函数而不是语句。但如果您使用的是NLTK，我怀疑您使用的是python2。你知道吗

重要提示：您创建的任何itertools.islice对象都将在其原始列表中迭代一次，然后变得“耗尽”（它们已返回所有数据，因此将它们放入第二个for循环将不会产生任何结果，for循环将不会产生任何结果）。如果您想多次遍历同一个列表，请创建多个迭代器（正如我在示例中所做的那样）。你知道吗

更新：以下是如何消除得分较低的单词。首先，替换此行：

tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

使用：

tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]

现在你得到的是一个（分数，句子）元组列表。然后我们将构造一个名为scores_and_sets的列表，它将是一个（score，sets，of，of，four，words）元组列表（其中sets_of_four_words是一个由四个单词片段组成的列表，如上面的示例所示）：

scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]

实际上，这一行可能有点太聪明了，所以让我们把它拆开，让它更具可读性：

scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
    set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
    score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
    scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)

继续用这两个代码片段进行实验，您会发现它们的作用完全相同。你知道吗

好的，现在我们有一个（分数，句子，四个单词组的集合）元组列表。因此，我们将按顺序浏览列表，并建立一个结果列表，其中只包含我们想要保留的句子。因为列表已经按降序排序了，所以事情就简单了一点，因为这意味着在列表的任何一点上，我们只需查看已经“接受”的项，看看其中是否有重复的项；如果任何接受的项与我们刚刚查看的项重复，我们甚至不需要查看在分数上，因为我们知道被接受的项目比我们看到的项目来得早，所以它的分数一定比我们看到的要高。你知道吗

所以这里有一些代码可以满足您的需要：

accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
    score, sentence, set_of_four_words = current_tuple
    found = False
    for accepted_tuple in accepted_items:
        accepted_score, accepted_sentence, accepted_set = accepted_tuple
        if set_of_four_words.intersection(accepted_set):
            found = True
            break
    if not found:
        accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences

相关问题更多 >

编程相关推荐

热门问题

热门文章