如何通过知道python中单词的偏移量从文本文件中获取原始句子？

2条回答

网友

1楼 · 编辑于 2024-09-28 16:21:51

如果您已经知道单词的位置，那么标记化不是您想要做的。通过标记化，您可以将序列（您知道其位置）更改为单词列表，其中您不知道哪个元素是您的单词

因此，你应该把它留在短语上，只需将短语的部分与你的单词进行比较：

with open("test.txt") as f:
    list_phrase = f.readlines()
    f.seek(0)
    contents = f.read()
    for index, phrase in enumerate(list_phrase):
        if phrase[13:18].lower() == "wheat": ## .lower() is only necessary if the word might be in upper case.
            print(list_phrase[index])

这将只返回wheat位于[13:18]位置的句子。所有其他小麦品种都不会被发现

网友

2楼 · 编辑于 2024-09-28 16:21:51

如果您正在寻找原始速度，那么标准库可能是最好的方法

# Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
    for _ in range(10000000):
        file.write("All work and no play makes Jack a dull boy.\n")
    file.write("Finally we get to the line containing the word 'wheat'.\n")

给定我们要查找的行中的search_word及其offset，我们可以计算用于字符串比较的limit

search_word = 'wheat'
offset = 48
limit = offset + len(search_word)

最简单的方法是迭代文本的枚举行，并对每行执行字符串比较

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

此解决方案的运行时是在2012 Mac mini（2.3GHz i7 CPU）上运行的。对于处理1000001行来说，这似乎相当快，但可以通过在尝试字符串比较之前检查文本长度来改进

with open('very-big.txt', 'r') as file:
    for line, text in enumerate(file, start=1):
        if (len(text) >= limit) and (text[offset:limit] == search_word):
            print(f'Line {line}: "{text.strip()}"')

改进解决方案的运行时是71 ms在同一台计算机上。这是一个显著的改进，但当然里程数会根据文本文件的不同而有所不同

生成的输出：

Line 10000001: "Finally we get to the line containing the word 'wheat'."

编辑：包括文件偏移量信息

with open('very-big.txt', 'r') as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if line_length >= limit and (text[offset:limit] == search_word):
            print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
        file_offset += line_length

样本输出：

[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."

再来一次

此代码检查文本的已知偏移量是否在当前行开始和行结束的偏移量值之间。在偏移处找到的文本也会得到验证

long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""

import io

search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)

# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
    file_offset = 0
    for line, text in enumerate(file, start=1):
        line_length = len(text)
        if file_offset < known_offset < (file_offset + line_length) \
        and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
            print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
        file_offset += line_length

输出：

[61,67]
Line: 2
Ceci est une deuxième phrase barley.

相关问题更多 >

编程相关推荐

热门问题

热门文章