尝试使用beauthulsoup从本地文件收集数据

import os import sys from bs4 import BeautifulSoup, SoupStrainer from unipath import Path def main(): ROOT = Path(os.path.realpath(__file__)).ancestor(3) src = ROOT.child("src") templatedir = src.child("templates") for (dirpath, dirs, files) in os.walk(templatedir): for path in (Path(dirpath, f) for f in files): if path.endswith(".html"): for link in BeautifulSoup(path, parse_only=SoupStrainer(target="_blank")): print link if __name__ == "__main__": sys.exit(main())

2条回答

网友

1楼 · 编辑于 2024-10-03 11:19:37

用法BeautifulSoup是可以的，但是您应该传入html字符串，而不仅仅是html文件的路径。BeautifulSoup接受html字符串作为参数，而不是文件路径。它不会打开它，然后自动读取内容。你应该自己做。如果你通过a.html，汤将是{}。这不是文件的内容。当然没有联系。您应该使用BeautifulSoup(open(path).read(), ...)。在

编辑：
它还接受文件描述符。BeautifulSoup(open(path), ...)就够了。在

网友

2楼 · 编辑于 2024-10-03 11:19:37

我想你需要这样的东西

if path.endswith(".html"):
    htmlfile = open(dirpath)
    for link in BeautifulSoup(htmlfile,parse_only=SoupStrainer(target="_blank")):
        print link

相关问题更多 >

编程相关推荐

热门问题

热门文章