Python正则表达式：将大型文本文件拆分为较小的部分

~~~~~~~~~~~~~~~~~~~~~~~ | | | First Block of text | | | ~~~~~~~~~~~~~~~~~~~~~~~ ----------------------- Monday 8 August 2021 ----------------------- ~~~~~~~~~~~~~~~~~~~~~~~ | | | Second Block of text | | | ~~~~~~~~~~~~~~~~~~~~~~~ ----------------------- Friday 12 August 2021 ----------------------- ~~~~~~~~~~~~~~~~~~~~~~~ | | | 3rd Block of text | | | ~~~~~~~~~~~~~~~~~~~~~~~ ----------------------- Friday 19 August 2021 ----------------------- ~~~~~~~~~~~~~~~~~~~~~~~ | | | 4th Block of text | | | ~~~~~~~~~~~~~~~~~~~~~~~

3条回答

网友

1楼 · 编辑于 2024-10-03 17:19:11

在您的模式中，您只在左侧和右侧匹配一个-，并且.*?匹配0+个字符，而不是换行符非贪婪字符

这将为您提供大量的部分匹配，而不是匹配整行

您还可以使用匹配项，使用捕获组1作为文件名，使用捕获组2作为数据

^-+([^-]+)-+((?:\n(?! ).*)*)

解释

^字符串的开头
-+匹配1+次-
([^-]+)捕获组1对于日期部分，匹配除-之外的所有字符
-+匹配1+次-
(为数据部分捕获组2
- (?:\n(?! ).*)*匹配所有不以开头的行
)关闭组2

Regex demo

比如说

import re

pattern = r"^-+([^-]+)-+((?:\n(?! ).*)*)"

s = (" ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| First Block of text   |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "           - Monday 8 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| Second Block of text  |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "           - Friday 12 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 3rd Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    " \n"
    "           - Friday 19 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 4th Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n")

matches = re.findall(pattern, s, re.M)
if matches:
    filename = matches[0][0].strip();
    data = matches[0][1].strip();
    
    print(filename)
    print(data)

输出

Monday 8 August 2021
~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

网友

2楼 · 编辑于 2024-10-03 17:19:11

您可以将块与以下表达式匹配，并使用第一个组作为文件名：

^
-+([^-]+)-+$
(.+?(?=^ |\Z))

参见a demo on regex101.com（注意修饰符）

网友

3楼 · 编辑于 2024-10-03 17:19:11

您可以使用：

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Monday 8 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Friday 12 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
           - Friday 19 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

a = re.split(r'-+(.*?)-+', a)

for k, v in enumerate(a):
    a[k] = a[k].strip()

print(a)

列出哪一位更简洁suggested by @fsimonjetz

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Monday 8 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Friday 12 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
           - Friday 19 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]

相关问题更多 >

编程相关推荐

热门问题

热门文章