Python：从一个txt文件抓取<html>和<html>之间的所有部分

def html_part(filepath): """ Generator returning only the HTML lines from an SEC Edgar SGML multi-part file. """ start, stop = '<html>\n', '</html>\n' filepath = os.path.expanduser(filepath) with open(filepath) as f: # find start indicator, yield it for line in f: if line == start: yield line break # yield lines until stop indicator found, yield and stop for line in f: yield line if line == stop: raise StopIteration

3条回答

网友

1楼 · 编辑于 2024-09-28 05:24:53

您可以使用regex来执行以下操作：

import re

content = open("filepath.txt", "r").read()
htmlPart = re.findall("<html>.*?</html>", content)
htmlPart = [i[6:-7] for i in htmlPart]

网友

2楼 · 编辑于 2024-09-28 05:24:53

您需要以一种可以多次迭代相同参数的方式来设置它。另外，是否需要用\n设置start和stop？如果<html>不换行直接移到下面的代码中会发生什么？HTML代码的结构是这样的，所以如果需要的话，您可以在一行中编写所有内容。你知道吗

因此，我首先将start和stop变量更改为不包含\n。你知道吗

start, stop = "<html>", "</html>"

下一步，调整循环，使其在同一信息上重复多次

with open(filepath) as f:
    # find start indicator, yield it
    switch = 0
    for line in f:
        if switch = 0:
            if start in line:
                yield line
                switch = 1
        elif switch = 1:
            yield line
            if stop in line:
                switch = 0
     raise StopIteration

网友

3楼 · 编辑于 2024-09-28 05:24:53

这应该可以完成这项工作，并将所有html部分分离到一个.html文件中

writing = False
html_file = open('my_file.html', 'a')
with open(origpath) as f:    
    for line in f:
        # find start indicator
        if line == start:
            writing = True
        if writing:
            html_file.write(line + '\n')
        # yield lines until stop indicator found
        if line == stop:
            writing = False

html_file.close()

相关问题更多 >

编程相关推荐

热门问题

热门文章