如何使用正则表达式从python中的片段中提取整个句子

3条回答

网友

1楼 · 编辑于 2024-09-29 02:24:02

使用re.sub我们可以尝试先删除不需要的重复文本。然后，执行第二次替换，将剩余的换行符替换为单个空格：

inp = """1
00:00:05.210  > 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710  > 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815  > 00:00:13.139
Regular expressions or regexes

4
00:00:13.139  > 00:00:15.825
are written in a condensed
formatting language."""

output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', inp)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
print(sentences)

这张照片是：

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
 'Regular expressions or regexes are written in a condensed formatting language.']

网友

2楼 · 编辑于 2024-09-29 02:24:02

您还可以匹配文件中数据的结构，以确保它存在

^\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  >.*\r?\n((?:(?!\d+\r?\n\d\d:).*(?:\r?\n|$))*)

消除

^字符串的开头
\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3} >.*匹配1+个数字、换行符和类时间模式
\r?\n匹配换行符
(捕获第1组
- (?:非捕获组
  - (?!\d+\r?\n\d\d:).*(?:\r?\n|$)如果不是以类似时间的模式开始，则匹配整行
- )*关闭组并重复0+次以匹配所有行
)关闭组1

参见在线regex demo Python demo

匹配将由列表中的re.findall返回的捕获组中的所有时间后文本模式

然后将所有部分合并为一个空字符串，用空格替换换行符，并在一个点后拆分为一个或多个空格字符

示例代码

regex = r"^\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  >.*\r?\n((?:(?!\d+\r?\n\d\d:).*(?:\r?\n|$))*)"
content = [i for i in re.split(r"(?<=\.)\s+", re.sub(r"[\r\n]+", " ", "".join(re.findall(regex, text, re.M)))) if i]
print(content)

输出

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.", 'Regular expressions or regexes are written in a condensed formatting language.']

网友

3楼 · 编辑于 2024-09-29 02:24:02

我发现@timbiegeleisen的解决方案带有复杂的正则表达式和多重替换，有点令人困惑，所以这里有另一个选择

import re

_file = """1
00:00:05.210  > 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710  > 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815  > 00:00:13.139
Regular expressions or regexes

4
00:00:13.139  > 00:00:15.825
are written in a condensed
formatting language.
"""

non_fragments = re.compile(r'$|\d+($|:\d+.*  > \d+.*$)')

full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
sentences = full_text.split('. ')

这将返回：

print(full_text)
In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language.

print(sentences)
["In this lecture, we're going to talk about pattern matching in strings using regular expressions", 'Regular expressions or regexes are written in a condensed formatting language.']

作为额外（小）奖励，此选项的速度至少是使用re.sub/re.findall的两倍

预编译正则表达式时效率最高。没有使用非常大的样本进行测试

%%timeit
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')
6.75 µs ± 831 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

但如果我们在每次迭代中都包含重新编译处理，则速度会更快

%%timeit
non_fragments = re.compile(r'$|\d+($|:\d+.*  > \d+.*$)')
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')  
7.97 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这个至少有两倍长。不确定这在非常大的文本中是如何表现的

%%timeit
output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', _file)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
15.2 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

相关问题更多 >

编程相关推荐

热门问题

热门文章

如何使用正则表达式从python中的片段中提取整个句子

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >