检查给定ord中列表的超集

2024-06-28 10:21:08 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个按降序排列的元组列表(float,string)。你知道吗

print sent_scores
[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.'),
 (0.078586381821416265,'Deadly stampede in Shanghai - Police and medical staff help injured people after the stampede.'),
 (0.072031446647399661, '- Emergency personnel help victims.')]

如果列表中有两个连续四个单词相同的情况。我想从列表中删除得分较低的元组。新的名单也应该保持秩序。你知道吗

上述输出:

[(0.10507038451969995,'Deadly stampede in Shanghai - Emergency personnel help victims.')]

这首先肯定涉及单词的标记化,可以通过下面的代码来完成:

from nltk.tokenize import TreebankWordTokenizer

def tokenize_words(text):
    tokens = TreebankWordTokenizer().tokenize(text)
    contractions = ["n't", "'ll", "'m","'s"]
    fix = []
    for i in range(len(tokens)):
        for c in contractions:
            if tokens[i] == c: fix.append(i)
    fix_offset = 0
    for fix_id in fix:
        idx = fix_id - 1 - fix_offset
        tokens[idx] = tokens[idx] + tokens[idx+1]
        del tokens[idx+1]
        fix_offset += 1
    return tokens
 tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

我之前试着把每个句子的单词转换成一组,每组4个包含一个集合,然后把issuperset用于其他句子。但它不检查连续性。你知道吗


Tags: in列表forhelpfix单词tokenizetokens
1条回答
网友
1楼 · 发布于 2024-06-28 10:21:08

我建议从标记化列表中获取一行4个标记的序列,并创建一组这些标记。通过使用Python的itertools模块,这可以非常优雅地完成:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
i1 = itertools.islice(my_list, 0, None)
i2 = itertools.islice(my_list, 1, None)
i3 = itertools.islice(my_list, 2, None)
i4 = itertools.islice(my_list, 3, None)
print zip(i1, i2, i3, i4)

以上代码的输出(格式很好):

[('The', 'quick', 'brown', 'fox'),
 ('quick', 'brown', 'fox', 'jumps'),
 ('brown', 'fox', 'jumps', 'over'),
 ('fox', 'jumps', 'over', 'the'),
 ('jumps', 'over', 'the', 'lazy'),
 ('over', 'the', 'lazy', 'dog')]

实际上,更优雅的是:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
iterators = [itertools.islice(my_list, x, None) for x in range(4)]
print zip(*iterators)

输出和以前一样。你知道吗

现在您已经为每个列表列出了四个连续的标记(作为4元组),您可以将这些标记粘贴到一个集合中,并检查相同的4元组是否出现在两个不同的集合中:

my_list = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
set1 = set(zip(*[itertools.islice(my_list, x, None) for x in range(4)]))

other_list = ['The', 'quick', 'red', 'fox', 'goes', 'home']
set2 = set(zip(*[itertools.islice(other_list, x, None) for x in range(4)]))

print set1.intersection(set2) # Empty set
if set1.intersection(set2):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Nothing in common"

third_list = ['The', 'quick', 'brown', 'fox', 'goes', 'to', 'school']
set3 = set(zip(*[itertools.islice(third_list, x, None) for x in range(4)]))

print set1.intersection(set3) # Set containing ('The', 'quick', 'brown', 'fox')
if set1.intersection(set3):
    print "Found something in common"
else:
    print "Nothing in common"
# Prints "Found something in common"

注意:如果您使用的是python3,只需将所有print "Something"语句替换为print("Something"):在python3中,print成为函数而不是语句。但如果您使用的是NLTK,我怀疑您使用的是python2。你知道吗

重要提示:您创建的任何itertools.islice对象都将在其原始列表中迭代一次,然后变得“耗尽”(它们已返回所有数据,因此将它们放入第二个for循环将不会产生任何结果,for循环将不会产生任何结果)。如果您想多次遍历同一个列表,请创建多个迭代器(正如我在示例中所做的那样)。你知道吗

更新:以下是如何消除得分较低的单词。首先,替换此行:

tokenized_sents=[tokenize_words(sentence) for score,sentence in sent_scores]

使用:

tokenized_sents=[(score,tokenize_words(sentence)) for score,sentence in sent_scores]

现在你得到的是一个(分数,句子)元组列表。然后我们将构造一个名为scores_and_sets的列表,它将是一个(score,sets,of,of,four,words)元组列表(其中sets_of_four_words是一个由四个单词片段组成的列表,如上面的示例所示):

scores_and_sentences_and_sets = [(score, sentence, set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))) for score,sentence in tokenized_sents]

实际上,这一行可能有点太聪明了,所以让我们把它拆开,让它更具可读性:

scores_and_sentences_and_sets = []
for score, sentence in tokenized_sents:
    set_of_four_word_groups = set(zip(*[itertools.islice(sentence, x, None) for x in range(4)]))
    score_sentence_and_sets_tuple = (score, sentence, set_of_four_word_groups)
    scores_and_sentences_and_sets.append(score_sentence_and_sets_tuple)

继续用这两个代码片段进行实验,您会发现它们的作用完全相同。你知道吗

好的,现在我们有一个(分数,句子,四个单词组的集合)元组列表。因此,我们将按顺序浏览列表,并建立一个结果列表,其中只包含我们想要保留的句子。因为列表已经按降序排序了,所以事情就简单了一点,因为这意味着在列表的任何一点上,我们只需查看已经“接受”的项,看看其中是否有重复的项;如果任何接受的项与我们刚刚查看的项重复,我们甚至不需要查看在分数上,因为我们知道被接受的项目比我们看到的项目来得早,所以它的分数一定比我们看到的要高。你知道吗

所以这里有一些代码可以满足您的需要:

accepted_items = []
for current_tuple in scores_and_sentences_and_sets:
    score, sentence, set_of_four_words = current_tuple
    found = False
    for accepted_tuple in accepted_items:
        accepted_score, accepted_sentence, accepted_set = accepted_tuple
        if set_of_four_words.intersection(accepted_set):
            found = True
            break
    if not found:
        accepted_items.append(current_tuple)
print accepted_items # Prints a whole bunch of tuples
sentences_only = [sentence for score, sentence, word_set in accepted_items]
print sentences_only # Prints just the sentences

相关问题 更多 >