<p>如果您正在寻找原始速度,那么标准库可能是最好的方法</p>
<pre><code># Generate a large text file with 10,000,001 lines.
with open('very-big.txt', 'w') as file:
for _ in range(10000000):
file.write("All work and no play makes Jack a dull boy.\n")
file.write("Finally we get to the line containing the word 'wheat'.\n")
</code></pre>
<p>给定我们要查找的行中的<code>search_word</code>及其<code>offset</code>,我们可以计算用于字符串比较的<code>limit</code></p>
<pre><code>search_word = 'wheat'
offset = 48
limit = offset + len(search_word)
</code></pre>
<p>最简单的方法是迭代文本的枚举行,并对每行执行字符串比较</p>
<pre><code>with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
</code></pre>
<p>此解决方案的运行时是在2012 Mac mini(2.3GHz i7 CPU)上运行的。对于处理1000001行来说,这似乎相当快,但可以通过在尝试字符串比较之前检查文本长度来改进</p>
<pre><code>with open('very-big.txt', 'r') as file:
for line, text in enumerate(file, start=1):
if (len(text) >= limit) and (text[offset:limit] == search_word):
print(f'Line {line}: "{text.strip()}"')
</code></pre>
<p>改进解决方案的运行时是<code>71 ms</code>在同一台计算机上。这是一个显著的改进,但当然里程数会根据文本文件的不同而有所不同</p>
<p>生成的输出:</p>
<pre><code>Line 10000001: "Finally we get to the line containing the word 'wheat'."
</code></pre>
<p><strong>编辑:</strong>包括文件偏移量信息</p>
<pre><code>with open('very-big.txt', 'r') as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if line_length >= limit and (text[offset:limit] == search_word):
print(f'[{file_offset + offset}, {file_offset + limit}] Line {line}: "{text.strip()}"')
file_offset += line_length
</code></pre>
<p>样本输出:</p>
<pre><code>[430000048, 430000053] Line 10000001: "Finally we get to the line containing the word 'wheat'."
</code></pre>
<p><strong>再来一次</p>
<p>此代码检查文本的已知偏移量是否在当前行开始和行结束的偏移量值之间。在偏移处找到的文本也会得到验证</p>
<pre><code>long_string = """Ceci est une wheat phrase corn.
Ceci est une deuxième phrase barley.
This is the third wheat word.
"""
import io
search_word = 'barley'
known_offset = 61
limit = known_offset + len(search_word)
# Use the multi-line string defined above as file input
with io.StringIO(long_string) as file:
file_offset = 0
for line, text in enumerate(file, start=1):
line_length = len(text)
if file_offset < known_offset < (file_offset + line_length) \
and (text[(known_offset-file_offset):(limit-file_offset)] == search_word):
print(f'[{known_offset},{limit}]\nLine: {line}\n{text}')
file_offset += line_length
</code></pre>
<p>输出:</p>
<pre><code>[61,67]
Line: 2
Ceci est une deuxième phrase barley.
</code></pre>