如何在pythonscrpit中提取两行之间的数据

2024-09-28 01:33:04 发布

您现在位置:Python中文网/ 问答频道 /正文

我想在两行之间提取数据,我有不同模式的文本文件,我有python代码,用于数字,但不用于文本,所以我需要帮助

我的文本文件格式1

TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017)
ANURAG ENTERPRISES ANURAG ENTERPRISES, VEDAVATHI NAGAR,CHALLAKERE ROAD HIRIYUR
State Code: 29

我的文本文件格式2

Page 1 of 1
KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road, Hosapete-583201 State Karnataka
State Code 29

我想要的输出

1.ANURAG ENTERPRISES ANURAG ENTERPRISES
2.KS LINGAPPA AND SON

for name in files:
with open(name, encoding="utf8") as infile:
 copy = False
 cnt=0
 for line in infile: 
        
           if line.strip()=="Page":
                 copy = True
                 continue
           if line.strip()=="TAX":
                 copy = True
                 continue
           elif line.strip() == "State":
                  copy = False
                  continue

           elif copy:
                 print(line)

Tags: of格式linepagecodestatestriptax
1条回答
网友
1楼 · 发布于 2024-09-28 01:33:04

正如Onno Rouast所评论的,提取规则是什么并不十分清楚。下面的两个例子都适用,但谁能说未来会带来什么呢

Regex Demo

import re

rex = r"""(?xm)         # extended mode and multiline
(?:^(?:Page|TAX).*\n)   # preceded by a line starting with either Page or TAX
\b([A-Z ]+)\b           # Looking for all capital letters or spaces"""

text = """TAX INVOICE (Under Rule 46 of the Central Goods & Service Tax Rules, 2017)
ANURAG ENTERPRISES ANURAG ENTERPRISES, VEDAVATHI NAGAR,CHALLAKERE ROAD HIRIYUR
State Code: 29

Page 1 of 1
KS LINGAPPA AND SON Industrial Area, Plot No 14. KSSIDC TBDam Road, Hosapete-583201 State Karnataka
State Code 29"""

companies = [s.strip() for s in re.findall(rex, text)]
print(companies)

印刷品:

['ANURAG ENTERPRISES ANURAG ENTERPRISES', 'KS LINGAPPA AND SON']

更新

import re

rex = r"""(?xm)         # extended mode and multiline
(?:^(?:Page|TAX).*\n)   # preceded by a line starting with either Page or TAX
\b([A-Z ]+)\b           # Looking for all capital letters or spaces"""

files = ['name1', name2', 'etc.']
all_companies = []
for name in files:
    with open(name, encoding="utf8") as infile:
        text = infile.read()
        # in case there can be multiple occurences in each file (it's not clear):
        companies = [s.strip() for s in re.findall(rex, text)]
        print(companies)
        all_companies.extend(companies) # list of all companies found in all files

相关问题 更多 >

    热门问题