如何使我的正则表达式匹配在前瞻后停止?

2024-09-28 20:49:16 发布

您现在位置:Python中文网/ 问答频道 /正文

我有一些pdf文件中的文本,我想把它分成一个字符串,这样我就有了一个列表,其中每个字符串都以一个数字和一个句点开头,然后在下一个数字之前停止

例如,我想将此转换为:

'3.1 First liens  15,209,670,396  0  15,209,670,396  14,216,703,858 
3.2 Other than first liens     0  0 
4. Real estate:
4.1 Properties occupied by  the company (less $  43,332,898 
encumbrances)  68,122,291  0  68,122,291  64,237,046 
4.2 Properties held for  the production of income (less 
$    encumbrances)       0  0 
4.3 Properties held for sale (less $  
encumbrances)      0  0 
5. Cash ($  (101,130,138)), cash equivalents 
($ 850,185,973 ) and short-term
 investments ($ 0 )  749,055,835  0  749,055,835  1,867,997,055 
6. Contract loans (including $   premium notes)  253,533,676  0  253,533,676  233,680,271 
7. Derivatives  3,194,189,871  0  3,194,189,871  2,390,781,023 
8. Other invested assets  749,074,191  11,899,360  737,174,831  692,916,503' 

为此:

['3.1 First liens  15,209,670,396  0  15,209,670,396  14,216,703,858 ',
'3.2 Other than first liens     0  0 ',
'4. Real estate:',
'4.1 Properties occupied by  the company (less $  43,332,898 encumbrances)  68,122,291  0  68,122,291  64,237,046',
'4.2 Properties held for  the production of income (less $    encumbrances)       0  0' 
'4.3 Properties held for sale (less $  encumbrances)      0  0',
'5. Cash ($  (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 ) 
749,055,835  0  749,055,835  1,867,997,055',
'6. Contract loans (including $   premium notes)  253,533,676  0  253,533,676  233,680,271',
'7. Derivatives  3,194,189,871  0  3,194,189,871  2,390,781,023',
'8. Other invested assets  749,074,191  11,899,360  737,174,831  692,916,503']
问题是原始字符串在名称的中间散布“\n”(例如,在4.1个单词中,在单词后缀之前有一个\n)。
(\d+\.[\s\S]*(?!\d+\.))

这是我一直尝试使用的正则表达式,但它匹配整个字符串而不是每个数字行。我的正则表达式有没有办法在下一个数字行之前停止匹配


Tags: the字符串for数字propertiesrealfirstless
3条回答

循环浏览找到的每个捕获组,包括:

^[\']?(?=[\d].)[\d].[\d]*([\s\w\,\:\(\)\$\-]*)[\']?[ ]*(\n|\Z)

比如:

list = re.findall(r"^\d+\..*?(?=^\d+\.|\Z)", text, re.MULTILINE | re.DOTALL)

应要求作进一步解释

import re

txt = '''3.1 First liens  15,209,670,396  0  15,209,670,396  14,216,703,858 
3.2 Other than first liens     0  0 
4. Real estate:
4.1 Properties occupied by  the company (less $  43,332,898 
encumbrances)  68,122,291  0  68,122,291  64,237,046 
4.2 Properties held for  the production of income (less 
$    encumbrances)       0  0 
4.3 Properties held for sale (less $  
encumbrances)      0  0 
5. Cash ($  (101,130,138)), cash equivalents 
($ 850,185,973 ) and short-term
 investments ($ 0 )  749,055,835  0  749,055,835  1,867,997,055 
6. Contract loans (including $   premium notes)  253,533,676  0  253,533,676  233,680,271 
7. Derivatives  3,194,189,871  0  3,194,189,871  2,390,781,023 
8. Other invested assets  749,074,191  11,899,360  737,174,831  692,916,503'''

x = re.split('[0-9]+\.[0-9]*', txt)
y = re.findall('[0-9]+\.[0-9]*', txt)
z = []

for i in range(len(y)):
    t = y[i]+x[i+1]
    z.append(t)

print(z)

仅当需要以空格换行时才需要替换

相关问题 更多 >