使用Python获取产品类别

2024-10-01 13:28:46 发布

您现在位置:Python中文网/ 问答频道 /正文

我试着把这个页面,它有大约21000个产品

我的问题是如何得到所有的产品名称,形象和完整的类别层次结构的21000个产品。 图像和名称在同一页上,但类别在实际产品页中。

由于分页,我只能得到32个产品的标题和图像,这是在第一页

从首页获取标题的代码

import requests
from bs4 import BeautifulSoup

main_url = "https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1"

import requests
result = requests.get(main_url)
print(result.text)

sp = BeautifulSoup(result.text,'html.parser')
print(sp.prettify())

getallTitle = [x.a.get('title') for x in sp.findAll("div", class_ = "_3WhJ")]

print(str(len(getallTitle )) + " fetched products Title")
print("/n")
print(getallTitle[2])

Tags: text图像importurl标题get产品main
3条回答

下面是如何处理分页问题。 分页只是按需发送请求,而不是立即获取请求。这意味着每次你点击任何一个页码,你都会看到一些根据网站设计的变化。 在您的例子中,url查询在每次单击任何页面链接时都会发生变化。生成的url是

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&category=101405&page=2

如果你继续把页面=2改成你想刮的任何一个页面,你就可以抓取网站了。在

Logic:

^{pr2}$

您可以访问每个页面的json响应。但请记住,每页只有32个产品,这意味着您将请求659次。在

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        try:
            category = product['attributes']['type']
        except:
            category = 'N/A'

        print ('%-20s ₹%-5s %-20s ₹%s' %(category, actual_price, brand, name))

输出:

^{pr2}$

编辑:

如果你想要层次结构,你需要转到每个产品的链接并把它拉出来。我提供了代码来实现这一点,但请记住,这将需要FORVER。假设每个请求大约需要2-3秒,则需要将近18个小时。在

# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        new_url = product['newurl']

        jsonData_product = requests.get(new_url, headers=headers).json()

        category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

输出:

Name: Red Label Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRED-LABEL-TETBL497475164B959/a_4.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea 500 gm

Name: Tata Tea Premium Leaf 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTATA-TEA-PREINNO985832A1E145F5/8.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Tata Tea Premium Leaf 250 gm

Name: Red Label Natural Care Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRLNC-C-500GNTBL4974726639099/a_14.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea & Coffee 500 Gm

Name: Taj Mahal Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTAJ-MAHAL-TEBIGB985832F0512392/0.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Taj Mahal Tea 500 gm

Name: Red Label Natural Care Tea 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNEW-RED-LABETBL49747FC4B364F/a_7.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Natural Care Tea 250 gm

Name: Nestle Everyday Dairy Whitener Milk 1 kg
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNESTLE-EVERYTBL497478E1F2966/a_8.jpg
Category: Supermarket/Foods/Dairy Products/Dairy Whitener/Nestle Everyday Dairy Whitener Milk 1 kg

如果所有产品都属于同一类别,那么您只需要获取第一个产品的类别,然后在遍历页面时应用于所有其他产品:

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
category = ''
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        if category == '':
            new_url = product['newurl']
            jsonData_product = requests.get(new_url, headers=headers).json()
            category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ][:-1] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

页面对pageone发出如下内容的请求(返回json)。看看你能不能改变参数来得到所有的结果

看起来你可以通过改变url来包含页面来改变referer头和正文中的当前页面

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&page=2

您可以从第一个请求中提取总结果计数

^{pr2}$

您知道您正在成批请求32个(不过请尝试将此值增加到可能的最大值)。然后可以计算页面/请求的数量,然后在循环中发出。在

Python(第1页请求)

import requests

headers = {
    'Content-Type' : 'application/json',
    'Referer' : 'https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1',
    'User-Agent' : 'Mozilla/5.0'
}

body = {"tracking":{"current_page":"https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1","prev_page":''},"context":{"device":{"os":"Win32","device_type":"PC","browser_uuid":"GA1.2.105449259.1558439396","ua":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36","connection_type":"Unknown"},"channel":"WEB","user":{"ga_id":"GA1.2.105449259.1558439396","user_id":''}}}

r = requests.post('https://middleware.paytmmall.com/fmcg-foods-glpid-101405?channel=web&child_site_id=6&site_id=2&version=2&discoverability=online&use_mw=1&items_per_page=32', json = body, headers = headers).json()

相关问题 更多 >