Python正则表达式:将大型文本文件拆分为较小的部分

2024-10-03 17:19:11 发布

您现在位置:Python中文网/ 问答频道 /正文

考虑下面的文本文件。

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Monday 8 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

----------------------- Friday 12 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
----------------------- Friday 19 August 2021 -----------------------

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

如何提取第二、第三和第四个块并根据上面给出的日期保存它们?例如,我需要提取

     ~~~~~~~~~~~~~~~~~~~~~~~
    |                       |
    | Second Block of text  |
    |                       |
     ~~~~~~~~~~~~~~~~~~~~~~~

然后将其保存到名为Monday 8 August 2021的文件或变量中

使用以下正则表达式,我可以找到包含日期的行:https://regex101.com/r/nKW1W4/1

-(?P<date>.*?)-

Tags: 文件oftexthttpscomdateblockfirst
3条回答

在您的模式中,您只在左侧和右侧匹配一个-,并且.*?匹配0+个字符,而不是换行符非贪婪字符

这将为您提供大量的部分匹配,而不是匹配整行


您还可以使用匹配项,使用捕获组1作为文件名,使用捕获组2作为数据

^-+([^-]+)-+((?:\n(?! ).*)*)

解释

  • ^字符串的开头
  • -+匹配1+次-
  • ([^-]+)捕获组1对于日期部分,匹配除-之外的所有字符
  • -+匹配1+次-
  • (为数据部分捕获组2
    • (?:\n(?! ).*)*匹配所有不以开头的行
  • )关闭组2

Regex demo

比如说

import re

pattern = r"^-+([^-]+)-+((?:\n(?! ).*)*)"

s = (" ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| First Block of text   |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "           - Monday 8 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| Second Block of text  |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n\n"
    "           - Friday 12 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 3rd Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    " \n"
    "           - Friday 19 August 2021            -\n\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n"
    "|                      |\n"
    "| 4th Block of text     |\n"
    "|                      |\n"
    " ~~~~~~~~~~~~~~~~~~~~~~~\n")

matches = re.findall(pattern, s, re.M)
if matches:
    filename = matches[0][0].strip();
    data = matches[0][1].strip();
    
    print(filename)
    print(data)

输出

Monday 8 August 2021
~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

您可以将块与以下表达式匹配,并使用第一个组作为文件名:

^
-+([^-]+)-+$
(.+?(?=^ |\Z))

参见a demo on regex101.com(注意修饰符)

您可以使用:

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Monday 8 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Friday 12 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
           - Friday 19 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

a = re.split(r'-+(.*?)-+', a)

for k, v in enumerate(a):
    a[k] = a[k].strip()

print(a)

列出哪一位更简洁suggested by @fsimonjetz

input_text = """
 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| First Block of text   |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Monday 8 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| Second Block of text  |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~

           - Friday 12 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 3rd Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
 
           - Friday 19 August 2021            -

 ~~~~~~~~~~~~~~~~~~~~~~~
|                       |
| 4th Block of text     |
|                       |
 ~~~~~~~~~~~~~~~~~~~~~~~
"""

result = [x.strip() for x in re.split(r'-+(.*?)-+', input_text)]

相关问题 更多 >