基于连词的递归组合句子

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.","New Sentence 4.","New Sentence 5.","And Sentence 6."] already_selected = [] attachlist = {} for i in tokens: attachlist[i] = [] for i in range(len(tokens)): if i in already_selected: pass else: for j in range(i+1, len(tokens)): if j not in already_selected: first_word = nltk.tokenize.word_tokenize(tokens[j].lower())[0] if first_word in conjucture_list: attachlist[tokens[i]].append(tokens[j]) already_selected.append(j) else: break

3条回答

网友

1楼 · 编辑于 2024-10-03 15:24:01

这个问题以迭代方式而不是递归地解决得更好，因为输出只需要一个级别的分组。如果您正在寻找递归解决方案，请给出任意级别分组的示例。在

def is_conjunction(sentence):
    return sentence.startswith('And') or sentence.startswith('Or')

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.",
          "New Sentence 4.","New Sentence 5.","And Sentence 6."]
def group_sentences_by_conjunction(sentences):
    result = []
    for s in sentences:
        if result and not is_conjunction(s):
            yield result #flush the last group
            result = []
        result.append(s)
    if result:
        yield result #flush the rest of the result buffer

>>> groups = group_sentences_by_conjunction(tokens)

当结果可能无法放入内存时，使用yield语句会更好，例如从存储在文件中的一本书中读取所有句子。如果出于某种原因需要将结果作为列表，请使用

^{pr2}$

结果：

[['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]

如果需要组号，请使用enumerate(groups)。在

is_conjunction将与其他答案中提到的问题相同。根据需要进行修改以满足您的条件。在

网友

2楼 · 编辑于 2024-10-03 15:24:01

我喜欢嵌入式迭代器和泛型，所以这里有一个超级泛型方法：

import re

class split_by:
    def __init__(self, iterable, predicate=None):
        self.iter = iter(iterable)
        self.predicate = predicate or bool

        try:
            self.head = next(self.iter)
        except StopIteration:
            self.finished = True
        else:
            self.finished = False

    def __iter__(self):
        return self

    def _section(self):
        yield self.head

        for self.head in self.iter:
            if self.predicate(self.head):
                break

            yield self.head

        else:
            self.finished = True

    def __next__(self):
        if self.finished:
            raise StopIteration

        section = self._section()
        return section

[list(x) for x in split_by(tokens, lambda sentence: not re.match("(?i)or|and", sentence))]
#>>> [['Sentence 1.', 'And Sentence 2.', 'Or Sentence 3.'], ['New Sentence 4.'], ['New Sentence 5.', 'And Sentence 6.']]

它比较长，但它是O(1)空间复杂性，并使用您选择的谓词。在

网友

3楼 · 编辑于 2024-10-03 15:24:01

tokens = ["Sentence 1.","And Sentence 2.","Or Sentence 3.",
          "New Sentence 4.","New Sentence 5.","And Sentence 6."]
result = list()
for token in tokens:
        if not token.startswith("And ") and not token.startswith("Or "): #trailing whitespace because of the cases like "Andy ..." and "Orwell ..."
            result.append([token])
        else:
            result[-1].append(token)

结果：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章