如何将字符串拆分为给定长度但不打断句子的子字符串?

2024-09-28 21:33:19 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一个包含大文本的字符串,需要将其拆分为多个子字符串,长度为<;=N个字符(尽可能接近N个字符;N总是大于最大的句子),但我也不需要打断句子

例如,如果N=80且给定文本:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel.

我要获取字符串列表:

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam."
"Nam sit amet iaculis lacus, non sagittis nulla."
"Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
"Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."

我也希望这能和英语和俄语一起使用

如何做到这一点


Tags: 字符串文本句子ipsumloremdolorametsit
2条回答

我将采取的步骤:

  • 启动一个列表来存储行,并启动一个当前的line变量来存储当前行的字符串
  • 将段落拆分为句子-这要求您在'.'.split,删除后面的空句子(""),去掉前面和后面的空格(.strip),然后添加句号
  • 循环下列句子:
    • 如果这个句子可以添加到当前行,请添加它
    • 否则,将当前工作行字符串添加到行列表中,并将当前行字符串设置为当前句子

因此,在Python中,类似于:

para = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer in tellus quam. Nam sit amet iaculis lacus, non sagittis nulla. Nam blandit quam eget velit maximus, eu consectetur sapien sodales. Etiam efficitur blandit arcu, quis rhoncus mauris elementum vel."
lines = []
line = ''
for sentence in (s.strip()+'.' for s in para.split('.')[:-1]):
    if len(line) + len(sentence) + 1 >= 80: #can't fit on that line => start new one
        lines.append(line)
        line = sentence
    else:                                   #can fit on => add a space then this sentence
        line += ' ' + sentence                

给予lines作为:

[
 "Lorem ipsum dolor sit amet, consectetur adipiscing elit.Integer in tellus quam.",
 "Nam sit amet iaculis lacus, non sagittis nulla.",
 "Nam blandit quam eget velit maximus, eu consectetur sapien sodales."
]

我找不到这个内置的,所以这里是一个开始。通过在之前和之后检查句子的移动位置,而不是只在前面,可以使它更智能。长度包括空格,因为我要分裂naï而不是用正则表达式什么的

def get_sentences(text, min_length):
    sentences = (sentence + ". "
                 for sentence in text.split(". "))
    current_line = ""
    for sentence in sentences:
        if len(current_line >= min_length):
            yield current_line
            current_line = sentence
        else:
            current_line += sentence
    yield current_line

排长队很慢,但也行

相关问题 更多 >