如何从Python中的字符串创建类似Google的文本片段?

2024-10-03 23:22:12 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试构建类似于Google文本片段的东西。googlesnippet包含高亮显示的关键字,并很好地“移动”文本,以防关键字没有出现在所分析字符串的开头。在

例如:

关键词“nike”

干草堆串“lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lore ipsum dorlor lorem ipsum dorlor loreipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum也难怪nike是世界上最大的品牌多勒·洛勒姆·伊普苏姆·多勒

应该变成这个片段:

。。。lorem ipsum dorlor难怪耐克是世界上最大的品牌之一不是lorem ipsum dorlor lorem lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem。。。在

到目前为止,我的想法是:

keywordPosition = haystack.lower().index(keyword.lower())
snippetStart = keywordPosition - 100
snippetEnd = keywordPosition + 200
haystack = " ..." + haystack[snippetStart:snippetEnd] + " ..."

python中有没有一种优雅的方式来动态调整snippetStart和snippetEnd?在许多情况下,上述方法显然抛出了一个例外,因为haystrack切片指数超出了范围。在


Tags: 文本google世界关键字lower品牌ipsumlorem
1条回答
网友
1楼 · 发布于 2024-10-03 23:22:12

我在这里创建了一个带有注释的小例子。在

http://pythonfiddle.com/google-snippet

haystack = "lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor it is no wonder that nike is one of the largest brands in the world is not lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor lorem ipsum dorlor"

needle = "nike342"

lookahead = 7  # Number of tokens to show before "nike"

tokens = haystack.split(" ")  # Split string into a list of tokens

found_index = -1  #  Represents the index of the token.  Initialize to -1 and assume it doesn't exist.

# Loop through tokens and compare each to the needle.  If we find the needle, rememeber the index and break out of the loop

found_index = tokens.index(needle)        

try:
    found_index = tokens.index(needle)
    # Get the max of the found index minus the number of words to show before the needle, and 0
    found_index = max(found_index - lookahead, 0)        

    # Create a sub list of the tokens from the found_index and end, then join those terms back together with a space.
    snippet = " ".join(tokens[found_index:len(tokens)])

except ValueError:
    snippet = ""  # No snippet or whatever error handling you are going to do

print snippet

相关问题 更多 >