用Regex删除相对行

2024-09-30 03:22:11 发布

男 | 程序猿一只，喜欢编程写python代码。

使用pdftotext创建了一个文本文件，其中包含来自源pdf的页脚。页脚妨碍了其他需要进行的解析。页脚格式如下：

This is important text.

9
Title 2012 and 2013

\fCompany
Important text begins again.

Company行是文件中唯一没有在其他地方重复出现的行。它显示为\x0cCompany\n。我想根据\x0cCompany\n出现的位置搜索这一行并删除它和前面的三行（页码、标题和空行）。到目前为止，我的情况是：

report = open('file.txt').readlines()
data = range(len(report))
name = []

for line_i in data:
    line = report[line_i]

    if re.match('.*\\x0cCompany', line ):
        name.append(report[line_i])

print name

这使我能够列出哪些行号出现这种情况，但我不知道如何删除这些行以及前面的三行。似乎我需要创建一些其他循环的基础上，这个循环，但我不能让它工作

Tags： text name report data pdf is 格式 line

1条回答

网友

1楼 · 发布于 2024-09-30 03:22:11

与其遍历并获取要删除的行的索引，不如遍历您的行并只附加您要保留的行

迭代实际的文件对象也比将其全部放在一个列表中更有效：

keeplines = []

with open('file.txt') as b:
    for line in b:
        if re.match('.*\\x0cCompany', line):
            keeplines = keeplines[:-3] #shave off the preceding lines
        else:
            keeplines.append(line)


file = open('file.txt', 'w'):
    for line in keeplines:
        file.write(line)

用Regex删除相对行

相关问题更多 >

编程相关推荐

热门问题

热门文章

用Regex删除相对行

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >