如何使用正则表达式从python中的片段中提取整个句子

2024-09-29 02:24:02 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个vtt文件,如下所示

WEBVTT

1
00:00:05.210 --> 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710 --> 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815 --> 00:00:13.139
Regular expressions or regexes

4
00:00:13.139 --> 00:00:15.825
are written in a condensed
formatting language.

我想从文件中提取片段并将它们合并成句子。输出应该如下所示

['In this lecture, we're going to talk about pattern matching in strings using regular expressions.', 'Regular expressions or regexes are written in a condensed formatting language.'

我可以用这个来提取片段

pattern = r"[A-z0-9 ,.*?='\";\n-\/%$#@!()]+"

content = [i for i in re.findall(pattern, text) if (re.search('[a-zA-Z]', i))]

我不知道如何提取完整的句子而不是片段

还要注意,这只是vtt文件的一个示例。整个vtt文件包含大约630个片段,其中一些片段还包含整数和其他特殊字符

谢谢你的帮助


Tags: 文件toinrethisexpressionsaboutwe
3条回答

使用re.sub我们可以尝试先删除不需要的重复文本。然后,执行第二次替换,将剩余的换行符替换为单个空格:

inp = """1
00:00:05.210  > 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710  > 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815  > 00:00:13.139
Regular expressions or regexes

4
00:00:13.139  > 00:00:15.825
are written in a condensed
formatting language."""

output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', inp)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
print(sentences)

这张照片是:

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.",
 'Regular expressions or regexes are written in a condensed formatting language.']

您还可以匹配文件中数据的结构,以确保它存在

^\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  >.*\r?\n((?:(?!\d+\r?\n\d\d:).*(?:\r?\n|$))*)

消除

  • ^字符串的开头
  • \d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3} >.*匹配1+个数字、换行符和类时间模式
  • \r?\n匹配换行符
  • (捕获第1组
    • (?:非捕获组
      • (?!\d+\r?\n\d\d:).*(?:\r?\n|$)如果不是以类似时间的模式开始,则匹配整行
    • )*关闭组并重复0+次以匹配所有行
  • )关闭组1

参见在线regex demoPython demo

匹配将由列表中的re.findall返回的捕获组中的所有时间后文本模式

然后将所有部分合并为一个空字符串,用空格替换换行符,并在一个点后拆分为一个或多个空格字符

示例代码

regex = r"^\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  >.*\r?\n((?:(?!\d+\r?\n\d\d:).*(?:\r?\n|$))*)"
content = [i for i in re.split(r"(?<=\.)\s+", re.sub(r"[\r\n]+", " ", "".join(re.findall(regex, text, re.M)))) if i]
print(content)

输出

["In this lecture, we're going to talk about pattern matching in strings using regular expressions.", 'Regular expressions or regexes are written in a condensed formatting language.']

我发现@timbiegeleisen的解决方案带有复杂的正则表达式和多重替换,有点令人困惑,所以这里有另一个选择

import re

_file = """1
00:00:05.210  > 00:00:07.710
In this lecture, we're
going to talk about

2
00:00:07.710  > 00:00:10.815
pattern matching in strings
using regular expressions.

3
00:00:10.815  > 00:00:13.139
Regular expressions or regexes

4
00:00:13.139  > 00:00:15.825
are written in a condensed
formatting language.
"""

non_fragments = re.compile(r'$|\d+($|:\d+.*  > \d+.*$)')

full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
sentences = full_text.split('. ')

这将返回:

print(full_text)
In this lecture, we're going to talk about pattern matching in strings using regular expressions. Regular expressions or regexes are written in a condensed formatting language.

print(sentences)
["In this lecture, we're going to talk about pattern matching in strings using regular expressions", 'Regular expressions or regexes are written in a condensed formatting language.']

作为额外(小)奖励,此选项的速度至少是使用re.sub/re.findall的两倍

预编译正则表达式时效率最高。没有使用非常大的样本进行测试

%%timeit
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')
6.75 µs ± 831 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

但如果我们在每次迭代中都包含重新编译处理,则速度会更快

%%timeit
non_fragments = re.compile(r'$|\d+($|:\d+.*  > \d+.*$)')
_full_text = " ".join([line for line in _file.splitlines() if not non_fragments.match(line)])
_sentences = _full_text.split('. ')  
7.97 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这个至少有两倍长。不确定这在非常大的文本中是如何表现的

%%timeit
output = re.sub(r'(?:^|\r?\n)\d+\r?\n\d{2}:\d{2}:\d{2}\.\d{3}  > \d{2}:\d{2}:\d{2}\.\d{3}\r?\n', '', _file)
output = re.sub(r'\r?\n', ' ', output)
sentences = re.findall(r'(.*?\.)\s*', output)
15.2 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

相关问题 更多 >