<p>我发现@timbiegeleisen的解决方案带有复杂的正则表达式和多重替换,有点令人困惑,所以这里有另一个选择</p>
<pre><code>import re
_file = """1
00:00:05.210 > 00:00:07.710
In this lecture, we're
going to talk about
2
00:00:07.710 > 00:00:10.815
pattern matching in strings
using regular expressions.
3
00:00:10.815 > 00:00:13.139
Regular expressions or regexes
4
00:00:13.139 > 00:00:15.825
are written in a condensed
formatting language.
"""
non_fragments = re.compile(r'$|\d+($|:\d+.* > \d+.*$)')
full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
sentences = full_text.split('. ')
</code></pre>
<p>这将返回:</p>
<pre><code>print(full_text)
In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language.
print(sentences)
["In this lecture, we're going to talk about pattern matching in strings using regular expressions", 'Regular expressions or regexes are written in a condensed formatting language.']
</code></pre>
<hr/>
<p>作为额外(小)奖励,此选项的速度至少是使用re.sub/re.findall的两倍</p>
<p>预编译正则表达式时效率最高。没有使用非常大的样本进行测试</p>
<pre><code>%%timeit
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')
6.75 µs ± 831 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
</code></pre>
<p>但如果我们在每次迭代中都包含重新编译处理,则速度会更快</p>
<pre><code>%%timeit
non_fragments = re.compile(r'$|\d+($|:\d+.* > \d+.*$)')
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')
7.97 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
</code></pre>
<p>这个至少有两倍长。不确定这在非常大的文本中是如何表现的</p>
<pre><code>%%timeit
output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3} > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', _file)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
15.2 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
</code></pre>