Python：获取关键字前后的文本

keywords = ("banana", "apple", "orange", ...) before = 50 after = 100 TEXT = "a big text string, i.e., a page of a book" for k in keywords: if k in TEXT: #cut = portion of text starting 'beforeText' chars before occurrence of 'k' and ending 'afterText' chars after occurrence of 'k' #finalcut = 'cut' with first and last WORDS trimmed to assure starting words are not cut in the middle

3条回答

网友

1楼 · 编辑于 2024-09-28 18:55:03

可以使用^{}查找字符串中的所有匹配项。每个匹配对象都有一个^{}方法，可以用来计算字符串中的位置。您也不需要检查键是否在字符串中，因为finditer返回一个空的迭代器：

keywords = ("banana", "apple", "orange", ...)
before = 50
after = 100
TEXT = "a big text string,  i.e., a page of a book"

for k in keywords:
    for match in re.finditer(k, TEXT):
        position = match.start()
        cut = TEXT[max(position - before, 0):position + after] # max is needed because that index must not be negative
        trimmed_match = re.match("\w*?\W+(.*)\W+\w*", cut, re.MULTILINE)
        finalcut = trimmed_match.group(1)

regex会修剪所有内容，包括第一个非单词字符序列，以及最后一个非单词字符序列（如果文本中有新行，我添加了re.MULTILINE）

网友

2楼 · 编辑于 2024-09-28 18:55:03

import string
import re

alphabet = string.lowercase + string.uppercase
regex1 = re.compile("(%s)" % "|".join(keywords))
regex2 = re.compile("^(%s)" % "|".join(keywords))
regex3 = re.compile("(%s)$" % "|".join(keywords))

for match in regex1.finditer(TEXT):
    cut = TEXT[max(match.start() - before, 0) : match.end() + after]
    finalcut = cut
    if not regex2.search(cut):
        finalcut = finalcut.lstrip(alphabet)
    if not regex3.search(cut):
        finalcut = finalcut.rstrip(alphabet)
    print cut, finalcut

这可以进一步改进，因为只有两次关键字可以出现在文本的开头或结尾，因此不应该删除。在

^{pr2}$

网友

3楼 · 编辑于 2024-09-28 18:55:03

你需要调整你的算法。如前所述，它是O（n*m），n是关键字的，而m是文本的长度。这不会很好地扩大规模。在

取而代之的是：

使keywords成为set，而不是tuple。您只关心针对keywords的成员身份测试，而set成员身份测试是O（1）。在
您需要标记TEXT。这比仅仅做split()要复杂一些，因为您还需要处理删除标点/换行符的操作。在
最后，使用“滑动窗口”迭代器，以3块为单位迭代标记。如果中间标记在keywords集中，请抓住它周围的标记并继续。在

就这样。所以，一些伪代码：

keywords = {"banana", "apple", "orange", ...}
tokens = tokenize(TEXT)

for before, target, after in window(tokens, n=3):
    if target in keywords:
        #do stuff with `before` and `after`

其中，window是您选择的类似于here的滑动窗口实现，tokenize要么是您自己的涉及split和{}的实现，或者如果您需要库解决方案，ntlk.tokenize。在

相关问题更多 >

编程相关推荐

热门问题

热门文章