BeautifulSoup找到全部img元素在所有网站上不起作用

# works for http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/ # but not http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/ import requests from PIL import Image from io import BytesIO from bs4 import BeautifulSoup def url_to_image(url, filename): # get HTTP response, open as bytes, save the image # http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content req = requests.get(url) i = Image.open(BytesIO(req.content)) i.save(filename) # open page, get HTML request and parse with BeautifulSoup html = requests.get("http://proof.nationalgeographic.com/2016/02/02/photo-of-the-day-best-of-january-3/") soup = BeautifulSoup(html.text, "html.parser") # find all JPEGS in our soup and write their "src" attribute to array urls = [] for img in soup.find_all("img"): if img["src"].endswith("jpg"): print("endswith jpg") urls.append(str(img["src"])) print(str(img)) jpeg_no = 00 for url in urls: url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg") jpeg_no += 1

1条回答

网友

1楼 · 发布于 2024-06-28 20:44:14

在出现故障的页面上用JavaScript呈现图像。首先用dryscrape呈现页面

（如果您不想使用dryscrap，请参见Web-scraping JavaScript page with Python）

例如

import requests
from PIL import Image
from io import BytesIO
from bs4 import BeautifulSoup
import dryscrape

def url_to_image(url, filename):
    # get HTTP response, open as bytes, save the image
    # http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
    req = requests.get(url)
    i = Image.open(BytesIO(req.content))
    i.save(filename)

# open page, get HTML request and parse with BeautifulSoup

session = dryscrape.Session()
session.visit("http://www.nationalgeographic.com/photography/proof/2017/05/lake-chad-desertification/")
response = session.body()
soup = BeautifulSoup(response, "html.parser")

# find all JPEGS in our soup and write their "src" attribute to array
urls = []
for img in soup.find_all("img"):
    if img["src"].endswith("jpg"):
        print("endswith jpg")
        urls.append(str(img["src"]))
        print(str(img))

jpeg_no = 00
for url in urls:
    url_to_image(url, filename="NatGeoPix/" + str(jpeg_no) + ".jpg")
    jpeg_no += 1

但我也会检查你是否有绝对的网址而不是相对网址：

^{pr2}$

相关问题更多 >

编程相关推荐

热门问题

热门文章