如何有效地建立句子级倒排索引？

2024-10-05 14:24:00 发布

男 | 程序猿一只，喜欢编程写python代码。

我试图在文档级反向索引的基础上构建一个句子级反向索引。我有一个文档级的倒排索引，如下所示：

'aaa': [doc1, doc2, doc3]
'bbb': [doc1, doc3]
'ccc': [doc2]
......

以及一套全文如下的文件：

doc1: ['this is a sentence containing aaa.', 'sentence containing bbb.', 'other sentence']
doc2: ['other sentence', 'sentence containing aaa', 'ccc']
doc3: ['aaa', 'bbb']
......

我想要的句子级倒排索引是：

'aaa': [doc1-sent0, doc2-sent1, doc3-sent0]
'bbb': [doc1-sent1, doc3-sent1]
'ccc': [doc2-sent2]
.......

首先，我查询文档的所有相关单词，以doc1为例：

doc1: ['aaa', 'bbb']

然后，我使用正则表达式迭代每个文档的每个句子：

for doc in documents:
    result = {}
    for i, sent in enuermate(doc.sentences):
        s = re.search('|'.join(related_word_list), sent)
        if s is not None:
            result[i] = s.group(0)

这段代码工作正常，但速度非常慢（每个文档大约3秒，这些文档是科学文献，平均8~9K字）。我可以用更快的方法吗？如果能达到每秒10个文档的速度，那就太好了

我没有使用多处理，但是没有它我能提高性能吗

Tags：文档 for is doc1 sentence 句子 other bbb

0条回答

目前没有回答

如何有效地建立句子级倒排索引？

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何有效地建立句子级倒排索引？

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >