BeautifulSoup html解析器花时间解析html文件

2024-10-02 22:32:09 发布

您现在位置:Python中文网/ 问答频道 /正文

我正在尝试使用BeautifulSoup从html文件中获取结果:

with open(r'/home/maria/Desktop/iqyylog.html', "r") as f:
    page = f.read()
soup = BeautifulSoup(page, 'html.parser')
for tag in soup.find_all('details'):
    print tag

这里的问题基本上是iqyylog.html文件包含2500多个节点。解析时,加载数据需要时间。有没有其他方法可以解析包含大量数据的HTML文件。当我使用lxml解析器时,它只占用前25个节点


Tags: 文件数据home节点htmlastagwith
1条回答
网友
1楼 · 发布于 2024-10-02 22:32:09

试试这个

from simplified_scrapy import SimplifiedDoc, utils

html = utils.getFileContent(r'test.html')
doc = SimplifiedDoc(html)
details = doc.selects('details')
for detail in details:
    print(detail.tag)

如果仍然存在问题,请尝试以下操作

import io
from simplified_scrapy import SimplifiedDoc, utils
def getDetails(fileName):
    details = []
    tag = 'details'
    with io.open(fileName, "r", encoding='utf-8') as file:
        # Suppose the start and end tags are not on the same line, as shown below
        # <details>
        #   some words
        # </details>
        line = file.readline()  # Read data line by line
        stanza = None # Store a details node
        while line != '':
            if line.strip() == '':
                line = file.readline()
                continue
            if stanza and line.find('</' + tag + '>') >= 0:
                doc = SimplifiedDoc(stanza + '</' + tag + '>')  # Instantiate a doc
                details.append(doc.select(tag))
                stanza = None
            elif stanza:
                stanza = stanza + line
            else:
                if line.find('<' + tag) >= 0:
                    stanza = line

            line = file.readline()
    return details


details = getDetails('test.html')
for detail in details:
    print(detail.tag)

相关问题 更多 >