无法使用BeautifulSoup和Python从高分辨率网页获取图像

#GET ALL IMG TAGS img_tags = soup2.find_all('img') #CREATE LIST WITH IMG TAGS urls_img = [img['src'] for img in img_tags] #CHECK EACH IMG TAG for murl in urls_img: filename = re.search(r'/([\w_-]+[.](jpg|png))$', murl) if filename is not None: with open(filename.group(1), 'wb') as f: if 'http' not in murl: murl = '{}{}'.format(site, murl) #GET THE RESPONSE OF IMG URL response = requests.get(murl) if response.status_code == 200: f.write(response.content)

2条回答

网友

1楼 · 编辑于 2024-09-30 20:34:06

编辑：在讨论之后，下面将获取初始产品URL（不包括占位符），并访问每个页面以查找最大的图像。最大的图像有一个属性['data-large_image']。你知道吗

我使用Session来提高重用连接的效率。你知道吗

import requests
from bs4 import BeautifulSoup as bs
url = 'http://zegetron.gr/b2b/product-category/pc/?products-per-page=all'
images = []
with requests.Session() as s:
    r = s.get(url)
    soup = bs(r.content, 'lxml')
    product_links = [item.select_one('a')['href'] for item in soup.select('.product-wrapper') if item.select_one('[src]:not(.woocommerce-placeholder)')]

    for link in product_links:
        r = s.get(link)
        soup = bs(r.content, 'lxml')
        images.append(soup.select_one('[data-large_image]')['data-large_image'])

以前的答案（基于所有产品的原始单一url）：

尝试下面的方法，在每个列表中查找srcset属性。如果存在，则采用列出的最后一个src链接（按升序分辨率排序），否则采用src属性。你知道吗

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('http://zegetron.gr/b2b/product-category/pc/?products-per-page=all')
soup = bs(r.content, 'lxml')
listings = soup.select('.thumb-wrapper')
images = []

for listing in listings:
    link = ''
    if listing.select_one(':has([srcset])'):
        links = listing.select_one('[srcset]')['srcset']
        link = links.split(',')[-1]
        link = link.split(' ')[1]
    else:
        if listing.select_one('[src]:not(.woocommerce-placeholder)'):
            link = listing.select_one('img[src]')['src']
    if link:
        images.append(link)
print(images)

网友

2楼 · 编辑于 2024-09-30 20:34:06

我发现这也许更容易解决我的问题

for each_img_tag in img_tags:
    width = each_img_tag.get('width')
    if width is not None and int(width)>500:
        urls_img.append(each_img_tag['src'])

即使我不知道它是否更快

相关问题更多 >

编程相关推荐

热门问题

热门文章