BeautifulSoup html解析器花时间解析html文件

1条回答

网友

1楼 · 发布于 2024-10-02 22:32:09

试试这个

from simplified_scrapy import SimplifiedDoc, utils

html = utils.getFileContent(r'test.html')
doc = SimplifiedDoc(html)
details = doc.selects('details')
for detail in details:
    print(detail.tag)

如果仍然存在问题，请尝试以下操作

import io
from simplified_scrapy import SimplifiedDoc, utils
def getDetails(fileName):
    details = []
    tag = 'details'
    with io.open(fileName, "r", encoding='utf-8') as file:
        # Suppose the start and end tags are not on the same line, as shown below
        # <details>
        #   some words
        # </details>
        line = file.readline()  # Read data line by line
        stanza = None # Store a details node
        while line != '':
            if line.strip() == '':
                line = file.readline()
                continue
            if stanza and line.find('</' + tag + '>') >= 0:
                doc = SimplifiedDoc(stanza + '</' + tag + '>')  # Instantiate a doc
                details.append(doc.select(tag))
                stanza = None
            elif stanza:
                stanza = stanza + line
            else:
                if line.find('<' + tag) >= 0:
                    stanza = line

            line = file.readline()
    return details


details = getDetails('test.html')
for detail in details:
    print(detail.tag)

相关问题更多 >

编程相关推荐

热门问题

热门文章

BeautifulSoup html解析器花时间解析html文件

相关问题 更多 >

编程相关推荐

热门问题

热门文章

相关问题更多 >