使用Python获取产品类别

3条回答

网友

1楼 · 编辑于 2024-10-01 13:28:46

下面是如何处理分页问题。分页只是按需发送请求，而不是立即获取请求。这意味着每次你点击任何一个页码，你都会看到一些根据网站设计的变化。在您的例子中，url查询在每次单击任何页面链接时都会发生变化。生成的url是

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&category=101405&page=2

如果你继续把页面=2改成你想刮的任何一个页面，你就可以抓取网站了。在

Logic:

^{pr2}$

网友

2楼 · 编辑于 2024-10-01 13:28:46

您可以访问每个页面的json响应。但请记住，每页只有32个产品，这意味着您将请求659次。在

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        try:
            category = product['attributes']['type']
        except:
            category = 'N/A'

        print ('%-20s ₹%-5s %-20s ₹%s' %(category, actual_price, brand, name))

输出：

^{pr2}$

编辑：

如果你想要层次结构，你需要转到每个产品的链接并把它拉出来。我提供了代码来实现这一点，但请记住，这将需要FORVER。假设每个请求大约需要2-3秒，则需要将近18个小时。在

# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        new_url = product['newurl']

        jsonData_product = requests.get(new_url, headers=headers).json()

        category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

输出：

Name: Red Label Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRED-LABEL-TETBL497475164B959/a_4.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea 500 gm

Name: Tata Tea Premium Leaf 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTATA-TEA-PREINNO985832A1E145F5/8.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Tata Tea Premium Leaf 250 gm

Name: Red Label Natural Care Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRLNC-C-500GNTBL4974726639099/a_14.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea & Coffee 500 Gm

Name: Taj Mahal Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTAJ-MAHAL-TEBIGB985832F0512392/0.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Taj Mahal Tea 500 gm

Name: Red Label Natural Care Tea 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNEW-RED-LABETBL49747FC4B364F/a_7.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Natural Care Tea 250 gm

Name: Nestle Everyday Dairy Whitener Milk 1 kg
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNESTLE-EVERYTBL497478E1F2966/a_8.jpg
Category: Supermarket/Foods/Dairy Products/Dairy Whitener/Nestle Everyday Dairy Whitener Milk 1 kg

或

如果所有产品都属于同一类别，那么您只需要获取第一个产品的类别，然后在遍历页面时应用于所有其他产品：

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
category = ''
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        if category == '':
            new_url = product['newurl']
            jsonData_product = requests.get(new_url, headers=headers).json()
            category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ][:-1] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

网友

3楼 · 编辑于 2024-10-01 13:28:46

页面对pageone发出如下内容的请求（返回json）。看看你能不能改变参数来得到所有的结果

看起来你可以通过改变url来包含页面来改变referer头和正文中的当前页面

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&page=2

您可以从第一个请求中提取总结果计数

^{pr2}$

您知道您正在成批请求32个（不过请尝试将此值增加到可能的最大值）。然后可以计算页面/请求的数量，然后在循环中发出。在

Python（第1页请求）

import requests

headers = {
    'Content-Type' : 'application/json',
    'Referer' : 'https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1',
    'User-Agent' : 'Mozilla/5.0'
}

body = {"tracking":{"current_page":"https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1","prev_page":''},"context":{"device":{"os":"Win32","device_type":"PC","browser_uuid":"GA1.2.105449259.1558439396","ua":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36","connection_type":"Unknown"},"channel":"WEB","user":{"ga_id":"GA1.2.105449259.1558439396","user_id":''}}}

r = requests.post('https://middleware.paytmmall.com/fmcg-foods-glpid-101405?channel=web&child_site_id=6&site_id=2&version=2&discoverability=online&use_mw=1&items_per_page=32', json = body, headers = headers).json()

相关问题更多 >

编程相关推荐

热门问题

热门文章