如何使用正则表达式从python中的片段中提取整个句子问题的回答

如何使用正则表达式从python中的片段中提取整个句子

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

我发现@timbiegeleisen的解决方案带有复杂的正则表达式和多重替换，有点令人困惑，所以这里有另一个选择 <pre><code>import re _file = """1 00:00:05.210 > 00:00:07.710 In this lecture, we're going to talk about 2 00:00:07.710 > 00:00:10.815 pattern matching in strings using regular expressions. 3 00:00:10.815 > 00:00:13.139 Regular expressions or regexes 4 00:00:13.139 > 00:00:15.825 are written in a condensed formatting language. """ non_fragments = re.compile(r'$|\d+($|:\d+.* > \d+.*$)') full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)]) sentences = full_text.split('. ') </code></pre> 这将返回： <pre><code>print(full_text) In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language. print(sentences) ["In this lecture, we're going to talk about pattern matching in strings using regular expressions", 'Regular expressions or regexes are written in a condensed formatting language.'] </code></pre> <hr/> 作为额外（小）奖励，此选项的速度至少是使用re.sub/re.findall的两倍 预编译正则表达式时效率最高。没有使用非常大的样本进行测试 <pre><code>%%timeit _full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)]) _sentences = _full_text.split('. ') 6.75 µs ± 831 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) </code></pre> 但如果我们在每次迭代中都包含重新编译处理，则速度会更快 <pre><code>%%timeit non_fragments = re.compile(r'$|\d+($|:\d+.* > \d+.*$)') _full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)]) _sentences = _full_text.split('. ') 7.97 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) </code></pre> 这个至少有两倍长。不确定这在非常大的文本中是如何表现的 <pre><code>%%timeit output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3} > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', _file) output = re.sub(r'\r?\n', ' ', output) sentences = re.findall(r'(.*?\.)\s*', output) 15.2 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) </code></pre>

如何使用正则表达式从python中的片段中提取整个句子

1 个回答

相关Python问题